> On Nov. 25, 2013, 8:07 p.m., Hari Shreedharan wrote:
> > Juhani,
> > 
> > I don't mind the idea of this patch, but I am not sure this works. If the 
> > flush fails, we ignore the failure and we set isOpen to false, we are not 
> > going to try to close the file again (barring retries which is possible in 
> > newer versions of hadoop and flume). If we don't try to close or flush 
> > again, the data could still be stuck in the local buffers and we risk data 
> > loss. The real issue is that there is some underlying failure in HDFS, 
> > which should be handled. 
> > 
> > I think the real solution is to retry closes, which we added in Flume a few 
> > months back, provided that version of HDFS supports it. Does that make 
> > sense?
> 
> Juhani Connolly wrote:
>     If you look at the logs I attached to the JIRA you will note that certain 
> lines are missing: for example the output from
>     
>     LOG.warn("failed to close() HDFSWriter for file (" + bucketPath + "). 
> Exception follows.", e); 
>     
>     I'm pretty confident this is because the IOException is occuring in the 
> flush() at the beginning of BucketWriter.close(). The close is called after 
> the writer.append() fails. This also fails with the appropriate logs getting 
> put out. There is no retry logic surrounding this, so the exception occurs 
> but none of the retry logic is entered(the retry logic comes in after the 
> flush)
>     
>     I've annotated the most relevant code in BucketWriter.append() and 
> close() these  are from the version we had installed. I compared to the most 
> recent source and based on the log output it's not entering the code path 
> where the close retries occur.
>     
>         // write the event
>         try {
>           sinkCounter.incrementEventDrainAttemptCount();
>           callWithTimeout(new CallRunner<Void>() {
>             @Override
>             public Void call() throws Exception {
>               writer.append(event); // could block  <-- this throws the 
> exception
>               return null;
>             }
>           });
>         } catch (IOException e) {
>           LOG.warn("Caught IOException writing to HDFSWriter ({}). Closing 
> file (" +
>               bucketPath + ") and rethrowing exception.",
>               e.getMessage());
>           try {
>             close();                           <-- this fails at the flush 
> returning the exception cause in the executor thread
>           } catch (IOException e2) {
>             LOG.warn("Caught IOException while closing file (" +
>                  bucketPath + "). Exception follows.", e2);
>           }
>           throw e;
>         }
>     
>     ==========================================================
>     
>       public synchronized void close() throws IOException, 
> InterruptedException {
>         checkAndThrowInterruptedException();
>         flush();   <-- this is where the failure occurs
>         LOG.debug("Closing {}", bucketPath);
>         if (isOpen) {
>           try {                     <-------- beyond here is where newer 
> versions add in the close retries. If they don't get past the flush that 
> logic  doesn't get a chance to kick in
>             callWithTimeout(new CallRunner<Void>() { 
>               @Override
>               public Void call() throws Exception {
>                 writer.close(); // could block
>                 return null;
>               }
>             });
> 
> Juhani Connolly wrote:
>     The close retries doesn't work for this case.
>     
>     I updated our build to the latest snapshot(with the exception of undoing 
> the protoc upgrade since we run on cdh 4.3, soon to be 4.5 which still uses 
> older protoc) about a week ago, and we got the same error.
>     
>     The issue is that it fails in the flush before it enters the try/catch 
> block. If it happens there, it never gets retried. I've put together another 
> build that we're testing locally, and I'll upload that once I'm confident the 
> issue is no longer happening(unfortunately it can take a while to happen)

So adding retries in all cases should handle it. This is a more general problem 
really, if close fails, presumably due to a transient HDFS issue the retries 
happen almost immediately, when HDFS might still be in a bad state - and the 
closes might still fail. We might want to fix this once and for all by putting 
this in a queue (failed close queue or something) and retry them periodically.


- Hari


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15779/#review29389
-----------------------------------------------------------


On Nov. 22, 2013, 8:38 a.m., Juhani Connolly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15779/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 8:38 a.m.)
> 
> 
> Review request for Flume.
> 
> 
> Repository: flume-git
> 
> 
> Description
> -------
> 
> https://issues.apache.org/jira/browse/FLUME-2245
> 
> Originally the flush() seemed superfluous however without it one of the unit 
> tests breaks.
> 
> By moving on beyond regardless of the flush succeeding or not we allow the 
> backing stream to actually get closed and reopened. While the real problem is 
> with the HDFS stream not recovering this workaround seems necessary as 
> otherwise appends will continue to fail until a restart.
> 
> Similarly HDFSDataStream and HDFSCompressedDataStream are closed regardless 
> of the success of serialization/flushing. The exception should be propagated 
> and cause a rollback so no data loss occurs.
> 
> 
> Diffs
> -----
> 
>   
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
>  200d457 
>   
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSCompressedDataStream.java
>  5518547 
>   
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSDataStream.java
>  e20d1ee 
> 
> Diff: https://reviews.apache.org/r/15779/diff/
> 
> 
> Testing
> -------
> 
> Existing unit tests pass.
> 
> I'm still trying to figure out a way to recreate the issue as it is hard to 
> determine the exact cause
> 
> 
> Thanks,
> 
> Juhani Connolly
> 
>

Reply via email to