[jira] [Updated] (FLUME-2353) BucketWriter throw IOException endlessly while failed to close file

chenshangan (JIRA) Wed, 02 Apr 2014 02:16:31 -0700

     [ 
https://issues.apache.org/jira/browse/FLUME-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


chenshangan updated FLUME-2353:
-------------------------------

    Description: 
sometimes .tmp file might lost block in hdfs, and HDFSWriter can not go on 
writing events or flush or close the file, so it will repeatedly try and catch 
the IOException.

The error stack is as following:

06 Aug 2013 04:27:08,853 WARN [DataStreamer for file 
**************************.1375732802628.lzo.tmp block 
blk_709795560527813415_25801594] 
(org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159) - 
Error Recovery for block blk_709795560527813415_25801594 failed because 
recovery from primary datanode *****:50010 failed 1 times. Pipeline was 
******:50010,******:50010, ******:50010. Will retry...
06 Aug 2013 04:27:08,990 WARN [DataStreamer for file 
**************************.1375732802628.lzo.tmp block 
blk_709795560527813415_25801594] 
(org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159) - 
Error Recovery for block blk_709795560527813415_25801594 failed because 
recovery from primary datanode ******:50010 failed 2 times. Pipeline was 
******:50010,******:50010, ******:50010. Will retry...
…
06 Aug 2013 04:27:50,694 WARN [DataStreamer for file 
**************************.1375732802628.lzo.tmp block 
blk_709795560527813415_25801594] 
(org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3139) - 
Error Recovery for block blk_709795560527813415_25801594 failed because 
recovery from primary datanode ******:50010 failed 6 times. Pipeline was 
******:50010,******:50010, ******:50010. Marking primary datanode as bad.

06 Aug 2013 04:30:40,365 WARN [SinkRunner-PollingRunner-FailoverSinkProcessor] 
(org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
java.io.IOException: Error Recovery for block blk_709795560527813415_25801594 
failed because recovery from primary datanode ********:50010 failed 6 times. 
Pipeline was *****:50010. Aborting...

DFSClient will try to recovery for missing block with a maximum times , if 
failed finally it will throw IOException. But HDFSWriter will rethrow the 
exception to HDFSEventSink, HDFSEventSink will rollback transaction and repeat 
the error stack, so it's a dead loop.


My suggestion and solution is to add a graceClose() method, and if it failed 
too many times, just leave the .tmp file alone.




> BucketWriter throw IOException endlessly while failed to close file
> -------------------------------------------------------------------
>
>                 Key: FLUME-2353
>                 URL: https://issues.apache.org/jira/browse/FLUME-2353
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: chenshangan
>            Assignee: chenshangan
>
> sometimes .tmp file might lost block in hdfs, and HDFSWriter can not go on 
> writing events or flush or close the file, so it will repeatedly try and 
> catch the IOException.
> The error stack is as following:
> 06 Aug 2013 04:27:08,853 WARN [DataStreamer for file 
> **************************.1375732802628.lzo.tmp block 
> blk_709795560527813415_25801594] 
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159) 
> - Error Recovery for block blk_709795560527813415_25801594 failed because 
> recovery from primary datanode *****:50010 failed 1 times. Pipeline was 
> ******:50010,******:50010, ******:50010. Will retry...
> 06 Aug 2013 04:27:08,990 WARN [DataStreamer for file 
> **************************.1375732802628.lzo.tmp block 
> blk_709795560527813415_25801594] 
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159) 
> - Error Recovery for block blk_709795560527813415_25801594 failed because 
> recovery from primary datanode ******:50010 failed 2 times. Pipeline was 
> ******:50010,******:50010, ******:50010. Will retry...
> …
> 06 Aug 2013 04:27:50,694 WARN [DataStreamer for file 
> **************************.1375732802628.lzo.tmp block 
> blk_709795560527813415_25801594] 
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3139) 
> - Error Recovery for block blk_709795560527813415_25801594 failed because 
> recovery from primary datanode ******:50010 failed 6 times. Pipeline was 
> ******:50010,******:50010, ******:50010. Marking primary datanode as bad.
> 06 Aug 2013 04:30:40,365 WARN 
> [SinkRunner-PollingRunner-FailoverSinkProcessor] 
> (org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
> java.io.IOException: Error Recovery for block blk_709795560527813415_25801594 
> failed because recovery from primary datanode ********:50010 failed 6 times. 
> Pipeline was *****:50010. Aborting...
> DFSClient will try to recovery for missing block with a maximum times , if 
> failed finally it will throw IOException. But HDFSWriter will rethrow the 
> exception to HDFSEventSink, HDFSEventSink will rollback transaction and 
> repeat the error stack, so it's a dead loop.
> My suggestion and solution is to add a graceClose() method, and if it failed 
> too many times, just leave the .tmp file alone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (FLUME-2353) BucketWriter throw IOException endlessly while failed to close file

Reply via email to