[ https://issues.apache.org/jira/browse/HDFS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000437#comment-13000437 ]
Matt Foley commented on HDFS-1667: ---------------------------------- Hi Brian, I agree that's a real risk. The hybrid solution I had in mind was: (a) Send a full block report at startup. (b) Send a checksum of a full report with every differential report -- the namenode can then scan its own blocklist for that datanode, and see if the checksum is what's expected. If not, it can demand a full block report from the datanode. (c) Send a full block report periodically, even if not asked (as you suggest). With the checksum in every diff report, I was thinking it would be adequate to send the full report daily, but of course that can be parameterized. > Consider redesign of block report processing > -------------------------------------------- > > Key: HDFS-1667 > URL: https://issues.apache.org/jira/browse/HDFS-1667 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Affects Versions: 0.22.0 > Reporter: Matt Foley > Assignee: Matt Foley > > The current implementation of block reporting has the datanode send the > entire list of blocks resident on that node in every report, hourly. > BlockManager.processReport in the namenode runs each block report through > reportDiff() first, to build four linked lists of replicas for different > dispositions, then processes each list. During that process, every block > belonging to the node (even the unchanged blocks) are removed and re-linked > in that node's blocklist. The entire process happens under the global > FSNamesystem write lock, so there is essentially zero concurrency. It takes > about 90 milliseconds to process a single 50,000-replica block report in the > namenode, during which no other read or write activities can occur. > There are several opportunities for improvement in this design. In order of > probable benefit, they include: > 1. Change the datanode to send a differential report. This saves the > namenode from having to do most of the work in reportDiff(), and avoids the > need to re-link all the unchanged blocks during the "diff" calculation. > 2. Keep the linked lists of "to do" work, but modify reportDiff() so that it > can be done under a read lock instead of the write lock. Then only the > processing of the lists needs to be under the write lock. Since the number > of blocks changed is usually much smaller than the number unchanged, this > should improve concurrency. > 3. Eliminate the linked lists and just immediately process each changed block > as it is read from the block report. The work on HDFS-1295 indicates that > this might save a large fraction of the total block processing time at scale, > due to the much smaller number of objects created and garbage collected > during processing of hundreds of millions of blocks. > 4. As a sidelight to #3, remove linked list use from BlockManager.addBlock(). > It currently uses linked lists as an argument to processReportedBlock() even > though addBlock() only processes a single replica on each call. This should > be replaced with a call to immediately process the block instead of enqueuing > it. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira