[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343695#comment-14343695
 ] 

Colin Patrick McCabe commented on HDFS-7836:
--------------------------------------------

bq. Arpit said: We do use one RPC per storage when the block count is over 1M 
viz. DFS_BLOCKREPORT_SPLIT_THRESHOLD_DEFAULT. The math doesn't work since 
protobuf uses vint on the wire. 9M blocks ~ 64MB was seen empirically in a 
couple of different deployments. It was used as the basis for the default of 1M.

Ah, thanks for that.  You are right that my math was off, because I was not 
considering how vints were encoded.

I was not aware that we have the ability to split full block reports (FBRs) 
into multiple RPCs at the moment.  I knew that we were releasing the lock after 
processing each storage, but I didn't realize the ability existed to do 
separate RPCs as well.  I'll have to look into that.  It certainly will help 
out a lot when it comes to reducing RPC size.  One interesting question is, how 
much testing has the split code received?  I think DNs with over 1 million 
blocks are still rare.

bq. Arpit wrote: With sequential allocation, a job with that does 'create N 
files, delete M files, repeat' could cause that imbalance over time.

Maybe I am missing something, but I don't see a realistic scenario where this 
could happen.  MR jobs are pretty much all multi-threaded (unless you are 
running a single node test cluster) and block allocations are not going to be 
"orderly"... the random noise in the system like random RPC delays, locking 
delays, and so forth will ensure that we don't see a particular mapper or 
reducer get all blocks mod 5 (or any other simple pattern like that).  Of 
course the cost of a more sophisticated hash function might be low enough that 
we should just do it anyway.

bq. Daryn said: Colin, please take a look and provide feedback on the doc I 
linked on HDFS-6658. It's been a multi-month effort to find a performant 
implementation. It's smaller in scope than the design here, but very similar in 
a number of ways.

In general, I didn't realize that HDFS-6658 was still being actively worked on 
since it was dormant (no comments) for 5 months until last Friday or so.  I did 
review it earlier but I had some concerns about running out of datanode 
indices.  I think those could be addressed (and might have been in the latest 
version), but only at the cost of adding a garbage-collection-style system.  
Also, as you mention, the scope is a lot smaller than this-- HDFS-6658 is a 
much more short-term solution.

[~clamb] is going to be around on 3/10, 3/11, and 3/12... does any of those 
days work for you for having a meetup?  Perhaps we can combine our efforts, if 
that turns out to make sense.

> BlockManager Scalability Improvements
> -------------------------------------
>
>                 Key: HDFS-7836
>                 URL: https://issues.apache.org/jira/browse/HDFS-7836
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Charles Lamb
>            Assignee: Charles Lamb
>         Attachments: BlockManagerScalabilityImprovementsDesign.pdf
>
>
> Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to