[jira] Commented: (HDFS-1667) Consider redesign of block report processing

Matt Foley (JIRA) Mon, 28 Feb 2011 10:03:01 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000437#comment-13000437
 ]


Matt Foley commented on HDFS-1667:
----------------------------------

Hi Brian, I agree that's a real risk.  The hybrid solution I had in mind was:
(a) Send a full block report at startup.
(b) Send a checksum of a full report with every differential report -- the 
namenode can then scan its own blocklist for that datanode, and see if the 
checksum is what's expected.  If not, it can demand a full block report from 
the datanode.
(c) Send a full block report periodically, even if not asked (as you suggest).  
With the checksum in every diff report, I was thinking it would be adequate to 
send the full report daily, but of course that can be parameterized.

> Consider redesign of block report processing
> --------------------------------------------
>
>                 Key: HDFS-1667
>                 URL: https://issues.apache.org/jira/browse/HDFS-1667
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>    Affects Versions: 0.22.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>
> The current implementation of block reporting has the datanode send the 
> entire list of blocks resident on that node in every report, hourly.  
> BlockManager.processReport in the namenode runs each block report through 
> reportDiff() first, to build four linked lists of replicas for different 
> dispositions, then processes each list.  During that process, every block 
> belonging to the node (even the unchanged blocks) are removed and re-linked 
> in that node's blocklist.  The entire process happens under the global 
> FSNamesystem write lock, so there is essentially zero concurrency.  It takes 
> about 90 milliseconds to process a single 50,000-replica block report in the 
> namenode, during which no other read or write activities can occur.
> There are several opportunities for improvement in this design.  In order of 
> probable benefit, they include:
> 1. Change the datanode to send a differential report.  This saves the 
> namenode from having to do most of the work in reportDiff(), and avoids the 
> need to re-link all the unchanged blocks during the "diff" calculation.
> 2. Keep the linked lists of "to do" work, but modify reportDiff() so that it 
> can be done under a read lock instead of the write lock.  Then only the 
> processing of the lists needs to be under the write lock.  Since the number 
> of blocks changed is usually much smaller than the number unchanged, this 
> should improve concurrency.
> 3. Eliminate the linked lists and just immediately process each changed block 
> as it is read from the block report.  The work on HDFS-1295 indicates that 
> this might save a large fraction of the total block processing time at scale, 
> due to the much smaller number of objects created and garbage collected 
> during processing of hundreds of millions of blocks.
> 4. As a sidelight to #3, remove linked list use from BlockManager.addBlock(). 
>  It currently uses linked lists as an argument to processReportedBlock() even 
> though addBlock() only processes a single replica on each call.  This should 
> be replaced with a call to immediately process the block instead of enqueuing 
> it.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HDFS-1667) Consider redesign of block report processing

Reply via email to