[
https://issues.apache.org/jira/browse/FLUME-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
chenshangan updated FLUME-2353:
-------------------------------
Description:
sometimes .tmp file might lost block in hdfs, and HDFSWriter can not go on
writing events or flush or close the file, so it will repeatedly try and catch
the IOException.
The error stack is as following:
06 Aug 2013 04:27:08,853 WARN [DataStreamer for file
**************************.1375732802628.lzo.tmp block
blk_709795560527813415_25801594]
(org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159) -
Error Recovery for block blk_709795560527813415_25801594 failed because
recovery from primary datanode *****:50010 failed 1 times. Pipeline was
******:50010,******:50010, ******:50010. Will retry...
06 Aug 2013 04:27:08,990 WARN [DataStreamer for file
**************************.1375732802628.lzo.tmp block
blk_709795560527813415_25801594]
(org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159) -
Error Recovery for block blk_709795560527813415_25801594 failed because
recovery from primary datanode ******:50010 failed 2 times. Pipeline was
******:50010,******:50010, ******:50010. Will retry...
…
06 Aug 2013 04:27:50,694 WARN [DataStreamer for file
**************************.1375732802628.lzo.tmp block
blk_709795560527813415_25801594]
(org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3139) -
Error Recovery for block blk_709795560527813415_25801594 failed because
recovery from primary datanode ******:50010 failed 6 times. Pipeline was
******:50010,******:50010, ******:50010. Marking primary datanode as bad.
06 Aug 2013 04:30:40,365 WARN [SinkRunner-PollingRunner-FailoverSinkProcessor]
(org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
java.io.IOException: Error Recovery for block blk_709795560527813415_25801594
failed because recovery from primary datanode ********:50010 failed 6 times.
Pipeline was *****:50010. Aborting...
DFSClient will try to recovery for missing block with a maximum times , if
failed finally it will throw IOException. But HDFSWriter will rethrow the
exception to HDFSEventSink, HDFSEventSink will rollback transaction and repeat
the error stack, so it's a dead loop.
My suggestion and solution is to add a graceClose() method, and if it failed
too many times, just leave the .tmp file alone.
> BucketWriter throw IOException endlessly while failed to close file
> -------------------------------------------------------------------
>
> Key: FLUME-2353
> URL: https://issues.apache.org/jira/browse/FLUME-2353
> Project: Flume
> Issue Type: Improvement
> Reporter: chenshangan
> Assignee: chenshangan
>
> sometimes .tmp file might lost block in hdfs, and HDFSWriter can not go on
> writing events or flush or close the file, so it will repeatedly try and
> catch the IOException.
> The error stack is as following:
> 06 Aug 2013 04:27:08,853 WARN [DataStreamer for file
> **************************.1375732802628.lzo.tmp block
> blk_709795560527813415_25801594]
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159)
> - Error Recovery for block blk_709795560527813415_25801594 failed because
> recovery from primary datanode *****:50010 failed 1 times. Pipeline was
> ******:50010,******:50010, ******:50010. Will retry...
> 06 Aug 2013 04:27:08,990 WARN [DataStreamer for file
> **************************.1375732802628.lzo.tmp block
> blk_709795560527813415_25801594]
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159)
> - Error Recovery for block blk_709795560527813415_25801594 failed because
> recovery from primary datanode ******:50010 failed 2 times. Pipeline was
> ******:50010,******:50010, ******:50010. Will retry...
> …
> 06 Aug 2013 04:27:50,694 WARN [DataStreamer for file
> **************************.1375732802628.lzo.tmp block
> blk_709795560527813415_25801594]
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3139)
> - Error Recovery for block blk_709795560527813415_25801594 failed because
> recovery from primary datanode ******:50010 failed 6 times. Pipeline was
> ******:50010,******:50010, ******:50010. Marking primary datanode as bad.
> 06 Aug 2013 04:30:40,365 WARN
> [SinkRunner-PollingRunner-FailoverSinkProcessor]
> (org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
> java.io.IOException: Error Recovery for block blk_709795560527813415_25801594
> failed because recovery from primary datanode ********:50010 failed 6 times.
> Pipeline was *****:50010. Aborting...
> DFSClient will try to recovery for missing block with a maximum times , if
> failed finally it will throw IOException. But HDFSWriter will rethrow the
> exception to HDFSEventSink, HDFSEventSink will rollback transaction and
> repeat the error stack, so it's a dead loop.
> My suggestion and solution is to add a graceClose() method, and if it failed
> too many times, just leave the .tmp file alone.
--
This message was sent by Atlassian JIRA
(v6.2#6252)