[
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343695#comment-14343695
]
Colin Patrick McCabe commented on HDFS-7836:
--------------------------------------------
bq. Arpit said: We do use one RPC per storage when the block count is over 1M
viz. DFS_BLOCKREPORT_SPLIT_THRESHOLD_DEFAULT. The math doesn't work since
protobuf uses vint on the wire. 9M blocks ~ 64MB was seen empirically in a
couple of different deployments. It was used as the basis for the default of 1M.
Ah, thanks for that. You are right that my math was off, because I was not
considering how vints were encoded.
I was not aware that we have the ability to split full block reports (FBRs)
into multiple RPCs at the moment. I knew that we were releasing the lock after
processing each storage, but I didn't realize the ability existed to do
separate RPCs as well. I'll have to look into that. It certainly will help
out a lot when it comes to reducing RPC size. One interesting question is, how
much testing has the split code received? I think DNs with over 1 million
blocks are still rare.
bq. Arpit wrote: With sequential allocation, a job with that does 'create N
files, delete M files, repeat' could cause that imbalance over time.
Maybe I am missing something, but I don't see a realistic scenario where this
could happen. MR jobs are pretty much all multi-threaded (unless you are
running a single node test cluster) and block allocations are not going to be
"orderly"... the random noise in the system like random RPC delays, locking
delays, and so forth will ensure that we don't see a particular mapper or
reducer get all blocks mod 5 (or any other simple pattern like that). Of
course the cost of a more sophisticated hash function might be low enough that
we should just do it anyway.
bq. Daryn said: Colin, please take a look and provide feedback on the doc I
linked on HDFS-6658. It's been a multi-month effort to find a performant
implementation. It's smaller in scope than the design here, but very similar in
a number of ways.
In general, I didn't realize that HDFS-6658 was still being actively worked on
since it was dormant (no comments) for 5 months until last Friday or so. I did
review it earlier but I had some concerns about running out of datanode
indices. I think those could be addressed (and might have been in the latest
version), but only at the cost of adding a garbage-collection-style system.
Also, as you mention, the scope is a lot smaller than this-- HDFS-6658 is a
much more short-term solution.
[~clamb] is going to be around on 3/10, 3/11, and 3/12... does any of those
days work for you for having a meetup? Perhaps we can combine our efforts, if
that turns out to make sense.
> BlockManager Scalability Improvements
> -------------------------------------
>
> Key: HDFS-7836
> URL: https://issues.apache.org/jira/browse/HDFS-7836
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Charles Lamb
> Assignee: Charles Lamb
> Attachments: BlockManagerScalabilityImprovementsDesign.pdf
>
>
> Improvements to BlockManager scalability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)