[ https://issues.apache.org/jira/browse/HDFS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003105#comment-13003105 ]
Matt Foley commented on HDFS-1667: ---------------------------------- Hi Dhruba, for current scale, yes it would help a lot. That's #2 on the list above. However, even if we improve concurrency, it would also be worthwhile to cut down the quantity of data processed per block report. Right now, we only have about 50K replicas on a typical datanode, but with bigger and cheaper drives, by the end of the year it will be practical to build datanodes with 4-6 times the storage. Since most of those ~300,000 replicas won't change on an hourly basis, it is wasteful to send (and reprocess) them all every hour. > Consider redesign of block report processing > -------------------------------------------- > > Key: HDFS-1667 > URL: https://issues.apache.org/jira/browse/HDFS-1667 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Affects Versions: 0.22.0 > Reporter: Matt Foley > Assignee: Matt Foley > > The current implementation of block reporting has the datanode send the > entire list of blocks resident on that node in every report, hourly. > BlockManager.processReport in the namenode runs each block report through > reportDiff() first, to build four linked lists of replicas for different > dispositions, then processes each list. During that process, every block > belonging to the node (even the unchanged blocks) are removed and re-linked > in that node's blocklist. The entire process happens under the global > FSNamesystem write lock, so there is essentially zero concurrency. It takes > about 90 milliseconds to process a single 50,000-replica block report in the > namenode, during which no other read or write activities can occur. > There are several opportunities for improvement in this design. In order of > probable benefit, they include: > 1. Change the datanode to send a differential report. This saves the > namenode from having to do most of the work in reportDiff(), and avoids the > need to re-link all the unchanged blocks during the "diff" calculation. > 2. Keep the linked lists of "to do" work, but modify reportDiff() so that it > can be done under a read lock instead of the write lock. Then only the > processing of the lists needs to be under the write lock. Since the number > of blocks changed is usually much smaller than the number unchanged, this > should improve concurrency. > 3. Eliminate the linked lists and just immediately process each changed block > as it is read from the block report. The work on HDFS-1295 indicates that > this might save a large fraction of the total block processing time at scale, > due to the much smaller number of objects created and garbage collected > during processing of hundreds of millions of blocks. > 4. As a sidelight to #3, remove linked list use from BlockManager.addBlock(). > It currently uses linked lists as an argument to processReportedBlock() even > though addBlock() only processes a single replica on each call. This should > be replaced with a call to immediately process the block instead of enqueuing > it. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira