[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339693#comment-14339693 ]
Colin Patrick McCabe commented on HDFS-7836: -------------------------------------------- bq. These two sound contradictory. I assume the former is correct and we won't really take the FSN lock. Also I did not get how you will process one stripe at a time without repeatedly locking and unlocking, since DataNodes wouldn't know about the block to stripe mapping to order the reports. I guess I will wait to see the code. Let me clarify a bit. It may be necessary to take the FSN lock very briefly once at the start or end of the block report, but in general, we do not want to do the block report under the FSN lock as it is done currently. The main processing of blocks should not happen under the FSN lock. The idea is to separate the BlocksMap into multiple stripes. Which blocks go into which stripes is determined by blockID. Something like "blockID mod 5" would work for this. Clearly, the incoming block report will contain work for several stripes. As you mentioned, that necessitates locking and unlocking. But we can do multiple blocks each time we grab a stripe lock. We simply accumulate a bunch of blocks that we know are in each stripe in a per-stripe buffer (maybe we do 1000 blocks at a time in between lock release... maybe a little more). This is something that we can tune to make a tradeoff between latency and throughput. This also means that we will have to be able to handle operations on blocks, new blocks being created, etc. during a block report. bq. IIRC the default 64MB protobuf message limit is hit at 9M blocks. Even with a hypothetical 10TB disk and a low average block size of 10MB, we get 1M blocks/disk in the foreseeable future. With splitting that gets you to a reasonable ~7MB block report per disk. I am not saying no to chunking/compression but it would be useful to see some perf comparison before we add that complexity. 1 M blocks per disk, on 10 disks, and 24 bytes per block, is a 240 MB block report (did I do that math right?) That's definitely bigger than we'd like the full BR RPC to be, and compression can help here. Or possibly separating the block report into multiple RPCs. Perhaps one RPC per storage? Incidentally, there's nothing magic about the 64 MB RPC size. I originally added that number as the max RPC size when I did HADOOP-9676, just by choosing a large power of 2 that seemed way too big for a real, non-corrupt RPC message. :) But big RPCs are bad in general because of the way Hadoop IPC works. To improve our latency, we probably want a block report RPC size much smaller 64 MB. > BlockManager Scalability Improvements > ------------------------------------- > > Key: HDFS-7836 > URL: https://issues.apache.org/jira/browse/HDFS-7836 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Charles Lamb > Assignee: Charles Lamb > Attachments: BlockManagerScalabilityImprovementsDesign.pdf > > > Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)