[
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337977#comment-14337977
]
Arpit Agarwal commented on HDFS-7836:
-------------------------------------
Hi Charles, thank you for posting the design doc. This is a very interesting
proposal. A few comments:
bq. Full block reports are problematic. The more blocks each DataNode has, the
longer it takes to process a full block report from that DataNode.
bq. Since the protobufs code encodes and decodes from Java arrays, this
necessitates the allocation of a very large java array, which may cause garbage
collection issues. By splitting block reports into multiple arrays, we can
avoid these issues.
bq. It will eliminate the long pauses that we have now as a consequence of full
block reports coming in
bq. We can compress block reports via lz4 or gzip.
The DataNode can now split block reports per storage directory (post
HDFS-2832), controlled by {{DFS_BLOCKREPORT_SPLIT_THRESHOLD_KEY}}. Did you get
a chance to try it out and see if it helps? Splitting reports addresses all of
the above.
bq. We have found that large Java heap sizes lead to very long startup times
for the NameNode. This is because multiple full GCs often occur during startup
... Reducing the size of the Java heap will improve the NameNode startup time.
Do you have any estimates for startup time overhead due to GCs?
bq. The Java heap size will be reduced, because the BlockManager data will be
offheap. We have found that BlockManager data makes up about 50% of a typical
NameNode heap.
Thanks for mentioning this in the doc, it's a significant point.
bq. We will shard the BlocksMap into several “lock stripes.” The number of lock
stripes should be matched to the number of CPUs available and the degree of
concurrency in the workload.
How does this affect block report processing? We cannot assume DataNodes will
sort blocks by target stripe. Will the NameNode sort received reports or will
it acquire+release a lock per block? If the former, then there should probably
be some randomization of order across threads to avoid unintended serialization
e.g. lock convoys.
bq. Block IDs would be sharded into stripes based on a simple algorithm
possibly modulus.
The scheme should probably not be that simple. Block IDs are sequentially
allocated so it is not hard to think of pathological app behavior causing
skewed block distribution across stripes over time.
bq. We will use the “flyweight pattern” to permit access to the offheap memory
by wrapping it with a shortlived Java object containing appropriate accessors
and mutators.
It will be interesting to measure the impact on GC and CPU, if any.
> BlockManager Scalability Improvements
> -------------------------------------
>
> Key: HDFS-7836
> URL: https://issues.apache.org/jira/browse/HDFS-7836
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Charles Lamb
> Assignee: Charles Lamb
> Attachments: BlockManagerScalabilityImprovementsDesign.pdf
>
>
> Improvements to BlockManager scalability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)