[ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suresh Srinivas updated HADOOP-4584: ------------------------------------ Attachment: 4584.patch Current loop in {{Datanode.OfferService()}} performs multiple steps as follows: 1. If in the next heartbeat interval {{sendHeartbeat}}. Process the {{DatanodeCommand}} from the namenode 2. If there is a block received send {{blockReceived}} request to the namenode 3. If in the next blockreport interval build and send {{blockReport}}. Process the {{DatanodeCommand}} from the namenode. 4. Wait till the next heartbeat interval or until another block is received 5. go back to 1. With the changes we have two threads. Heartbeat Thread: 1. New thread sends heartbeat and receives {{DatanodeCommand}} in response. Queues the command to an arraylist. Main thread does the following without the previous heartbeat functionality: 1. If there are commands in the queue, process all of them. 2. If there is a block received send {{blockReceived}} request to the namenode 3. If in the next blockreport interval build and send {{blockReport}}. Process the {{DatanodeCommand}} from the namenode. 4. If there are no blocks recieved or commands to process wait for 1 second or until another block is received 5. go back to 1. Questions: 1. In step 4. should we wait for receiving a command or for receiving another block? 2. In OfferService we process all the commands that are in the queue at once. Do you see any issues with it? > Slow generation of blockReport at DataNode causes delay of sending heartbeat > to NameNode > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4584 > URL: https://issues.apache.org/jira/browse/HADOOP-4584 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Reporter: Hairong Kuang > Assignee: Suresh Srinivas > Fix For: 0.20.0 > > Attachments: 4584.patch > > > sometimes due to disk or some other problems, datanode takes minutes or tens > of minutes to generate a block report. It causes the datanode not able to > send heartbeat to NameNode every 3 seconds. In the worst case, it makes > NameNode to detect a lost heartbeat and wrongly decide that the datanode is > dead. > It would be nice to have two threads instead. One thread is for scanning data > directories and generating block report, and executes the requests sent by > NameNode; Another thread is for sending heartbeats, block reports, and > picking up the requests from NameNode. By having these two threads, the > sending of heartbeats will not get delayed by any slow block report or slow > execution of NameNode requests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.