[ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678553#action_12678553 ]
Konstantin Shvachko commented on HADOOP-4584: --------------------------------------------- Separating a HB thread from the main offerService thread has the following disadvantages: # This does not remove contention on processing blocks reports. That is, the data-node is still blocked preparing block report and cannot do anything useful like send blockReceived or process commands from the name-node. The only good thing is that it does not die. # We loose automatic data-node activity throttling with this. Meaning that while the data-node is busy it still sends heartbeats and name-node replies with commands, which are piled up in the queue because the DN cannot process them. This can probably be solved with a smart command queue maintenance or by adjusting of heartbeat frequency with respect to the length of the queue, but will require more work and very thorough tuning. # Related to previous. Administrators will no longer be able to judge that a data-node is in trouble by just looking at its heartbeat interval. So I would argue to keep HB processing in the main offerService loop, but rather separate the block report processing into a separate thread. In general we should keep all heavy-weight operation like delete-blocks away from the offer service loop. They can be done in separate threads. Does that make me a supporter of "Option 3"? > Slow generation of blockReport at DataNode causes delay of sending heartbeat > to NameNode > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4584 > URL: https://issues.apache.org/jira/browse/HADOOP-4584 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Reporter: Hairong Kuang > Assignee: Suresh Srinivas > Fix For: 0.20.0 > > Attachments: 4584.hbthread.patch, 4584.patch, 4584.patch, 4584.patch, > 4584.patch, 4584.patch, 4584.patch > > > sometimes due to disk or some other problems, datanode takes minutes or tens > of minutes to generate a block report. It causes the datanode not able to > send heartbeat to NameNode every 3 seconds. In the worst case, it makes > NameNode to detect a lost heartbeat and wrongly decide that the datanode is > dead. > It would be nice to have two threads instead. One thread is for scanning data > directories and generating block report, and executes the requests sent by > NameNode; Another thread is for sending heartbeats, block reports, and > picking up the requests from NameNode. By having these two threads, the > sending of heartbeats will not get delayed by any slow block report or slow > execution of NameNode requests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.