> On Nov. 25, 2013, 8:07 p.m., Hari Shreedharan wrote:
> > Juhani,
> >
> > I don't mind the idea of this patch, but I am not sure this works. If the
> > flush fails, we ignore the failure and we set isOpen to false, we are not
> > going to try to close the file again (barring retries which is possible in
> > newer versions of hadoop and flume). If we don't try to close or flush
> > again, the data could still be stuck in the local buffers and we risk data
> > loss. The real issue is that there is some underlying failure in HDFS,
> > which should be handled.
> >
> > I think the real solution is to retry closes, which we added in Flume a few
> > months back, provided that version of HDFS supports it. Does that make
> > sense?
If you look at the logs I attached to the JIRA you will note that certain lines
are missing: for example the output from
LOG.warn("failed to close() HDFSWriter for file (" + bucketPath + "). Exception
follows.", e);
I'm pretty confident this is because the IOException is occuring in the flush()
at the beginning of BucketWriter.close(). The close is called after the
writer.append() fails. This also fails with the appropriate logs getting put
out. There is no retry logic surrounding this, so the exception occurs but none
of the retry logic is entered(the retry logic comes in after the flush)
I've annotated the most relevant code in BucketWriter.append() and close()
these are from the version we had installed. I compared to the most recent
source and based on the log output it's not entering the code path where the
close retries occur.
// write the event
try {
sinkCounter.incrementEventDrainAttemptCount();
callWithTimeout(new CallRunner<Void>() {
@Override
public Void call() throws Exception {
writer.append(event); // could block <-- this throws the exception
return null;
}
});
} catch (IOException e) {
LOG.warn("Caught IOException writing to HDFSWriter ({}). Closing file (" +
bucketPath + ") and rethrowing exception.",
e.getMessage());
try {
close(); <-- this fails at the flush
returning the exception cause in the executor thread
} catch (IOException e2) {
LOG.warn("Caught IOException while closing file (" +
bucketPath + "). Exception follows.", e2);
}
throw e;
}
==========================================================
public synchronized void close() throws IOException, InterruptedException {
checkAndThrowInterruptedException();
flush(); <-- this is where the failure occurs
LOG.debug("Closing {}", bucketPath);
if (isOpen) {
try { <-------- beyond here is where newer versions
add in the close retries. If they don't get past the flush that logic doesn't
get a chance to kick in
callWithTimeout(new CallRunner<Void>() {
@Override
public Void call() throws Exception {
writer.close(); // could block
return null;
}
});
- Juhani
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15779/#review29389
-----------------------------------------------------------
On Nov. 22, 2013, 8:38 a.m., Juhani Connolly wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15779/
> -----------------------------------------------------------
>
> (Updated Nov. 22, 2013, 8:38 a.m.)
>
>
> Review request for Flume.
>
>
> Repository: flume-git
>
>
> Description
> -------
>
> https://issues.apache.org/jira/browse/FLUME-2245
>
> Originally the flush() seemed superfluous however without it one of the unit
> tests breaks.
>
> By moving on beyond regardless of the flush succeeding or not we allow the
> backing stream to actually get closed and reopened. While the real problem is
> with the HDFS stream not recovering this workaround seems necessary as
> otherwise appends will continue to fail until a restart.
>
> Similarly HDFSDataStream and HDFSCompressedDataStream are closed regardless
> of the success of serialization/flushing. The exception should be propagated
> and cause a rollback so no data loss occurs.
>
>
> Diffs
> -----
>
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
> 200d457
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSCompressedDataStream.java
> 5518547
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSDataStream.java
> e20d1ee
>
> Diff: https://reviews.apache.org/r/15779/diff/
>
>
> Testing
> -------
>
> Existing unit tests pass.
>
> I'm still trying to figure out a way to recreate the issue as it is hard to
> determine the exact cause
>
>
> Thanks,
>
> Juhani Connolly
>
>