[ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680343#action_12680343 ]
Suresh Srinivas commented on HADOOP-4584: ----------------------------------------- Based on the discussions so far, here is a proposal: # DataBlockScanner will be enhanced to periodically check to see if the blocks on the disk matches blocks in memory. # Block list is compiled from disk and in-memory map. The two lists are compared to find the following inconsistencies: ## Block is in memory and not on the disk ## Block is on the disk and not in memory ## Block on the disk does not match the block in memory # Reconciling differences is done one difference at a time. FSDataset lock is held to prevent further block changes and a check is done to ensure inconsistency found still exists (to account for changes that might have happened while checking the disk for block files): ## If a block file is missing on the disk, block is deleted in memory ## If a block metadata file is missing on the disk, block in memory is updated with generation stamp as zero (as done in block reports currently) ## If a block is missing in memory, then it is added to FSDataset ## If blocks do not match, the in-memory block is updated to reflect the block on the disk ## A block metafile that does not have corresponding block file will be deleted from the disk # Block report will be generated from the in-memory data > Slow generation of blockReport at DataNode causes delay of sending heartbeat > to NameNode > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4584 > URL: https://issues.apache.org/jira/browse/HADOOP-4584 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Reporter: Hairong Kuang > Assignee: Suresh Srinivas > Fix For: 0.20.0 > > Attachments: 4584.hbthread.patch, 4584.patch, 4584.patch, 4584.patch, > 4584.patch, 4584.patch, 4584.patch > > > sometimes due to disk or some other problems, datanode takes minutes or tens > of minutes to generate a block report. It causes the datanode not able to > send heartbeat to NameNode every 3 seconds. In the worst case, it makes > NameNode to detect a lost heartbeat and wrongly decide that the datanode is > dead. > It would be nice to have two threads instead. One thread is for scanning data > directories and generating block report, and executes the requests sent by > NameNode; Another thread is for sending heartbeats, block reports, and > picking up the requests from NameNode. By having these two threads, the > sending of heartbeats will not get delayed by any slow block report or slow > execution of NameNode requests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.