Name-node should demand a block report from resurrected data-nodes. -------------------------------------------------------------------
Key: HADOOP-641 URL: http://issues.apache.org/jira/browse/HADOOP-641 Project: Hadoop Issue Type: Bug Affects Versions: 0.7.2, 0.1.0 Reporter: Konstantin Shvachko 1. This bug contributed to the crash discussed in HADOOP-572. The problem is that when the name-node is busy, and is not able to process all requests from its clients, it can consider one of data-nodes dead and discard its blocks sending them into the neededRelications list. When it finally gets the heartbeat from this data-node it resurrects the node, but not the data-node blocks, and hence continues to replicate them. Of course, eventually the name-node will receive the block report from this data-node, but it could take up to 1 hour. During this time it proceeds with unnecessary block replications, which could be avoided if the data-node sent its block report right after the resurrection. I modified code so that the name-node requests block report if it receives a heartbeat from a dead data-node. I introduced a new command type in the BlockCommand class. I replaced multiple boolean indicators of the command types by one enum field. I changed the DatanodeProtocol version. 2. This patch also includes a fix for the data-node registration. If a data-nodes times out during registration it silently exits, which is hard to notice with a large number of nodes. This patch places registration in a loop, so that it could retry. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira