> On Nov. 25, 2013, 8:07 p.m., Hari Shreedharan wrote:
> > Juhani,
> >
> > I don't mind the idea of this patch, but I am not sure this works. If the
> > flush fails, we ignore the failure and we set isOpen to false, we are not
> > going to try to close the file again (barring retries which is possible in
> > newer versions of hadoop and flume). If we don't try to close or flush
> > again, the data could still be stuck in the local buffers and we risk data
> > loss. The real issue is that there is some underlying failure in HDFS,
> > which should be handled.
> >
> > I think the real solution is to retry closes, which we added in Flume a few
> > months back, provided that version of HDFS supports it. Does that make
> > sense?
>
> Juhani Connolly wrote:
> If you look at the logs I attached to the JIRA you will note that certain
> lines are missing: for example the output from
>
> LOG.warn("failed to close() HDFSWriter for file (" + bucketPath + ").
> Exception follows.", e);
>
> I'm pretty confident this is because the IOException is occuring in the
> flush() at the beginning of BucketWriter.close(). The close is called after
> the writer.append() fails. This also fails with the appropriate logs getting
> put out. There is no retry logic surrounding this, so the exception occurs
> but none of the retry logic is entered(the retry logic comes in after the
> flush)
>
> I've annotated the most relevant code in BucketWriter.append() and
> close() these are from the version we had installed. I compared to the most
> recent source and based on the log output it's not entering the code path
> where the close retries occur.
>
> // write the event
> try {
> sinkCounter.incrementEventDrainAttemptCount();
> callWithTimeout(new CallRunner<Void>() {
> @Override
> public Void call() throws Exception {
> writer.append(event); // could block <-- this throws the
> exception
> return null;
> }
> });
> } catch (IOException e) {
> LOG.warn("Caught IOException writing to HDFSWriter ({}). Closing
> file (" +
> bucketPath + ") and rethrowing exception.",
> e.getMessage());
> try {
> close(); <-- this fails at the flush
> returning the exception cause in the executor thread
> } catch (IOException e2) {
> LOG.warn("Caught IOException while closing file (" +
> bucketPath + "). Exception follows.", e2);
> }
> throw e;
> }
>
> ==========================================================
>
> public synchronized void close() throws IOException,
> InterruptedException {
> checkAndThrowInterruptedException();
> flush(); <-- this is where the failure occurs
> LOG.debug("Closing {}", bucketPath);
> if (isOpen) {
> try { <-------- beyond here is where newer
> versions add in the close retries. If they don't get past the flush that
> logic doesn't get a chance to kick in
> callWithTimeout(new CallRunner<Void>() {
> @Override
> public Void call() throws Exception {
> writer.close(); // could block
> return null;
> }
> });
>
> Juhani Connolly wrote:
> The close retries doesn't work for this case.
>
> I updated our build to the latest snapshot(with the exception of undoing
> the protoc upgrade since we run on cdh 4.3, soon to be 4.5 which still uses
> older protoc) about a week ago, and we got the same error.
>
> The issue is that it fails in the flush before it enters the try/catch
> block. If it happens there, it never gets retried. I've put together another
> build that we're testing locally, and I'll upload that once I'm confident the
> issue is no longer happening(unfortunately it can take a while to happen)
So adding retries in all cases should handle it. This is a more general problem
really, if close fails, presumably due to a transient HDFS issue the retries
happen almost immediately, when HDFS might still be in a bad state - and the
closes might still fail. We might want to fix this once and for all by putting
this in a queue (failed close queue or something) and retry them periodically.
- Hari
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15779/#review29389
-----------------------------------------------------------
On Nov. 22, 2013, 8:38 a.m., Juhani Connolly wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15779/
> -----------------------------------------------------------
>
> (Updated Nov. 22, 2013, 8:38 a.m.)
>
>
> Review request for Flume.
>
>
> Repository: flume-git
>
>
> Description
> -------
>
> https://issues.apache.org/jira/browse/FLUME-2245
>
> Originally the flush() seemed superfluous however without it one of the unit
> tests breaks.
>
> By moving on beyond regardless of the flush succeeding or not we allow the
> backing stream to actually get closed and reopened. While the real problem is
> with the HDFS stream not recovering this workaround seems necessary as
> otherwise appends will continue to fail until a restart.
>
> Similarly HDFSDataStream and HDFSCompressedDataStream are closed regardless
> of the success of serialization/flushing. The exception should be propagated
> and cause a rollback so no data loss occurs.
>
>
> Diffs
> -----
>
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
> 200d457
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSCompressedDataStream.java
> 5518547
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSDataStream.java
> e20d1ee
>
> Diff: https://reviews.apache.org/r/15779/diff/
>
>
> Testing
> -------
>
> Existing unit tests pass.
>
> I'm still trying to figure out a way to recreate the issue as it is hard to
> determine the exact cause
>
>
> Thanks,
>
> Juhani Connolly
>
>