[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339693#comment-14339693
 ] 

Colin Patrick McCabe commented on HDFS-7836:
--------------------------------------------

bq. These two sound contradictory. I assume the former is correct and we won't 
really take the FSN lock. Also I did not get how you will process one stripe at 
a time without repeatedly locking and unlocking, since DataNodes wouldn't know 
about the block to stripe mapping to order the reports. I guess I will wait to 
see the code. 

Let me clarify a bit.  It may be necessary to take the FSN lock very briefly 
once at the start or end of the block report, but in general, we do not want to 
do the block report under the FSN lock as it is done currently.  The main 
processing of blocks should not happen under the FSN lock.

The idea is to separate the BlocksMap into multiple stripes.  Which blocks go 
into which stripes is determined by blockID.  Something like "blockID mod 5" 
would work for this.  Clearly, the incoming block report will contain work for 
several stripes.  As you mentioned, that necessitates locking and unlocking.  
But we can do multiple blocks each time we grab a stripe lock.  We simply 
accumulate a bunch of blocks that we know are in each stripe in a per-stripe 
buffer (maybe we do 1000 blocks at a time in between lock release... maybe a 
little more).  This is something that we can tune to make a tradeoff between 
latency and throughput.  This also means that we will have to be able to handle 
operations on blocks, new blocks being created, etc. during a block report.

bq. IIRC the default 64MB protobuf message limit is hit at 9M blocks. Even with 
a hypothetical 10TB disk and a low average block size of 10MB, we get 1M 
blocks/disk in the foreseeable future. With splitting that gets you to a 
reasonable ~7MB block report per disk. I am not saying no to 
chunking/compression but it would be useful to see some perf comparison before 
we add that complexity.

1 M blocks per disk, on 10 disks, and 24 bytes per block, is a 240 MB block 
report (did I do that math right?)  That's definitely bigger than we'd like the 
full BR RPC to be, and compression can help here.  Or possibly separating the 
block report into multiple RPCs.  Perhaps one RPC per storage?

Incidentally, there's nothing magic about the 64 MB RPC size.  I originally 
added that number as the max RPC size when I did HADOOP-9676, just by choosing 
a large power of 2 that seemed way too big for a real, non-corrupt RPC message. 
:)  But big RPCs are bad in general because of the way Hadoop IPC works.  To 
improve our latency, we probably want a block report RPC size much smaller 64 
MB.

> BlockManager Scalability Improvements
> -------------------------------------
>
>                 Key: HDFS-7836
>                 URL: https://issues.apache.org/jira/browse/HDFS-7836
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Charles Lamb
>            Assignee: Charles Lamb
>         Attachments: BlockManagerScalabilityImprovementsDesign.pdf
>
>
> Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to