[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337977#comment-14337977
 ] 

Arpit Agarwal commented on HDFS-7836:
-------------------------------------

Hi Charles, thank you for posting the design doc. This is a very interesting 
proposal. A few comments:
bq. Full block reports are problematic. The more blocks each DataNode has, the 
longer it takes to process a full block report from that DataNode.
bq. Since the protobufs code encodes and decodes from Java arrays, this 
necessitates the allocation of a very large java array, which may cause garbage 
collection issues. By splitting block reports into multiple arrays, we can 
avoid these issues.
bq. It will eliminate the long pauses that we have now as a consequence of full 
block reports coming in
bq. We can compress block reports via lz4 or gzip.
The DataNode can now split block reports per storage directory (post 
HDFS-2832), controlled by {{DFS_BLOCKREPORT_SPLIT_THRESHOLD_KEY}}. Did you get 
a chance to try it out and see if it helps? Splitting reports addresses all of 
the above.
bq. We have found that large Java heap sizes lead to very long startup times 
for the NameNode. This is because multiple full GCs often occur during startup 
... Reducing the size of the Java heap will improve the NameNode startup time.
Do you have any estimates for startup time overhead due to GCs?
bq. The Java heap size will be reduced, because the BlockManager data will be 
off­heap. We have found that BlockManager data makes up about 50% of a typical 
NameNode heap.
Thanks for mentioning this in the doc, it's a significant point.
bq. We will shard the BlocksMap into several “lock stripes.” The number of lock 
stripes should be matched to the number of CPUs available and the degree of 
concurrency in the workload. 
How does this affect block report processing? We cannot assume DataNodes will 
sort blocks by target stripe. Will the NameNode sort received reports or will 
it acquire+release a lock per block? If the former, then there should probably 
be some randomization of order across threads to avoid unintended serialization 
e.g. lock convoys.
bq. Block IDs would be sharded into stripes based on a simple algorithm 
­­possibly modulus.
The scheme should probably not be that simple. Block IDs are sequentially 
allocated so it is not hard to think of pathological app behavior causing 
skewed block distribution across stripes over time.
bq. We will use the “flyweight pattern” to permit access to the off­heap memory 
by wrapping it with a short­lived Java object containing appropriate accessors 
and mutators.
It will be interesting to measure the impact on GC and CPU, if any.

> BlockManager Scalability Improvements
> -------------------------------------
>
>                 Key: HDFS-7836
>                 URL: https://issues.apache.org/jira/browse/HDFS-7836
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Charles Lamb
>            Assignee: Charles Lamb
>         Attachments: BlockManagerScalabilityImprovementsDesign.pdf
>
>
> Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to