[ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676355#action_12676355 ]
Brian Bockelman commented on HADOOP-4584: ----------------------------------------- Hey Raghu, - Regarding your above point about periodic block verification handling the various things that can go wrong with a block: Currently, it's woefully insufficient, especially on large data noes, to replace the directory scan. If we wait 3 weeks (or several months for some of our large nodes) before we find a block is missing, we're going to see lots and lots of issues crop up! - I have seen the 'rm -r' in practice, by the way :). - With a reasonably sized block, we've had 48TB servers be able to only take a few minutes for a scan: no heartbeats lost. That said, I do like your argument that the DN should handle things to the best of its abilities and not die. I like the idea of the patch, but only if it's combined with an occasional offline scan (even once a day!). Creeping inconsistency bugs in the NN seem to make very accurate block reports a precious commodity, one that I'd gladly pay an expensive scan for (though I agree that once an hour is probably excessive). > Slow generation of blockReport at DataNode causes delay of sending heartbeat > to NameNode > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4584 > URL: https://issues.apache.org/jira/browse/HADOOP-4584 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Reporter: Hairong Kuang > Assignee: Suresh Srinivas > Fix For: 0.20.0 > > Attachments: 4584.patch, 4584.patch, 4584.patch, 4584.patch, > 4584.patch, 4584.patch > > > sometimes due to disk or some other problems, datanode takes minutes or tens > of minutes to generate a block report. It causes the datanode not able to > send heartbeat to NameNode every 3 seconds. In the worst case, it makes > NameNode to detect a lost heartbeat and wrongly decide that the datanode is > dead. > It would be nice to have two threads instead. One thread is for scanning data > directories and generating block report, and executes the requests sent by > NameNode; Another thread is for sending heartbeats, block reports, and > picking up the requests from NameNode. By having these two threads, the > sending of heartbeats will not get delayed by any slow block report or slow > execution of NameNode requests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.