[
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339693#comment-14339693
]
Colin Patrick McCabe commented on HDFS-7836:
--------------------------------------------
bq. These two sound contradictory. I assume the former is correct and we won't
really take the FSN lock. Also I did not get how you will process one stripe at
a time without repeatedly locking and unlocking, since DataNodes wouldn't know
about the block to stripe mapping to order the reports. I guess I will wait to
see the code.
Let me clarify a bit. It may be necessary to take the FSN lock very briefly
once at the start or end of the block report, but in general, we do not want to
do the block report under the FSN lock as it is done currently. The main
processing of blocks should not happen under the FSN lock.
The idea is to separate the BlocksMap into multiple stripes. Which blocks go
into which stripes is determined by blockID. Something like "blockID mod 5"
would work for this. Clearly, the incoming block report will contain work for
several stripes. As you mentioned, that necessitates locking and unlocking.
But we can do multiple blocks each time we grab a stripe lock. We simply
accumulate a bunch of blocks that we know are in each stripe in a per-stripe
buffer (maybe we do 1000 blocks at a time in between lock release... maybe a
little more). This is something that we can tune to make a tradeoff between
latency and throughput. This also means that we will have to be able to handle
operations on blocks, new blocks being created, etc. during a block report.
bq. IIRC the default 64MB protobuf message limit is hit at 9M blocks. Even with
a hypothetical 10TB disk and a low average block size of 10MB, we get 1M
blocks/disk in the foreseeable future. With splitting that gets you to a
reasonable ~7MB block report per disk. I am not saying no to
chunking/compression but it would be useful to see some perf comparison before
we add that complexity.
1 M blocks per disk, on 10 disks, and 24 bytes per block, is a 240 MB block
report (did I do that math right?) That's definitely bigger than we'd like the
full BR RPC to be, and compression can help here. Or possibly separating the
block report into multiple RPCs. Perhaps one RPC per storage?
Incidentally, there's nothing magic about the 64 MB RPC size. I originally
added that number as the max RPC size when I did HADOOP-9676, just by choosing
a large power of 2 that seemed way too big for a real, non-corrupt RPC message.
:) But big RPCs are bad in general because of the way Hadoop IPC works. To
improve our latency, we probably want a block report RPC size much smaller 64
MB.
> BlockManager Scalability Improvements
> -------------------------------------
>
> Key: HDFS-7836
> URL: https://issues.apache.org/jira/browse/HDFS-7836
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Charles Lamb
> Assignee: Charles Lamb
> Attachments: BlockManagerScalabilityImprovementsDesign.pdf
>
>
> Improvements to BlockManager scalability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)