[ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676407#action_12676407 ]
Suresh Srinivas commented on HADOOP-4584: ----------------------------------------- Currently one of the version of the patches (Feb 10) introduces separate heartbeat thread without the other changes. I would like to get consensus on if we need to detect missing files faster than what block verification can do. Once we agree to that, we can go for long term solution, which should include: Block deleted report: Sending block deleted (much like block received) from datanode to namenode. Currently block report is the way namenode learns about the deleted blocks. With this change, we can send block reports less frequently. Faster block scanning functionality for missing/lingering files: - We could have a thread that lists the files (without holding a global lock) and reconciles the blocks on the disk with blocks maintained in the FSDataset. This could be done by deleting the blocks from FSDataset map under the following conditions: - block file or a block meta data file is missing on the disk but exists in FSDataset map - block meta data does not match block information (block size and generation stamp) in FSDataset map - block file or block meta data file exists on the disk and does not exist in FSDataset map I was thinking of doing this in a separate jira for long term. > Slow generation of blockReport at DataNode causes delay of sending heartbeat > to NameNode > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4584 > URL: https://issues.apache.org/jira/browse/HADOOP-4584 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Reporter: Hairong Kuang > Assignee: Suresh Srinivas > Fix For: 0.20.0 > > Attachments: 4584.patch, 4584.patch, 4584.patch, 4584.patch, > 4584.patch, 4584.patch > > > sometimes due to disk or some other problems, datanode takes minutes or tens > of minutes to generate a block report. It causes the datanode not able to > send heartbeat to NameNode every 3 seconds. In the worst case, it makes > NameNode to detect a lost heartbeat and wrongly decide that the datanode is > dead. > It would be nice to have two threads instead. One thread is for scanning data > directories and generating block report, and executes the requests sent by > NameNode; Another thread is for sending heartbeats, block reports, and > picking up the requests from NameNode. By having these two threads, the > sending of heartbeats will not get delayed by any slow block report or slow > execution of NameNode requests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.