[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-04-30 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522480#comment-14522480
 ] 

Colin Patrick McCabe commented on HDFS-7836:


Hi [~xinwei],

The discussion on March 11th was focused on our proposal for off-heaping and 
parallelizing the block manager from February 24th.  We spent a lot of time 
going through the proposal and responding to questions on the proposal.

There was widespread agreement that we needed to reduce the garbage collection 
impact of the millions of BlockInfoContiguous structures.  There was some 
disagreement about how to do that.  Daryn argued that using large primitive 
arrays was the best way to go.  Charles and I argued that using off-heap 
storage was better.

The main advantage of large primitive arrays is that it makes the existing Java 
\-Xmx memory settings work as expected.  The main advantage of off-heap is that 
it allows the use of things like {{Unsafe#compareAndSwap}}, which can often 
lead to more efficient concurrent data structures.  Also, when using off-heap 
memory, we get to re-use malloc rather than essentially writing our own malloc 
for every subsystem.

There was some hand-wringing about off-heap memory being slower, but I do not 
believe that this is valid.  Apache Spark has found that their off-heap hash 
table was actually faster than the on-heap one, due to the ability to better 
control the memory layout.  
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
  The key is to avoid using {{DirectByteBuffer}}, which is rather slow, and use 
{{Unsafe}} instead.

However, Daryn has posted some patches using the large arrays approach.  
Since they are a nice incremental improvement, we are probably going to pick 
them up if there are no blockers.  We are also looking at incremental 
improvements such as implementing backpressure for full block reports, and 
speeding up edit log replay (if possible).  I would also like to look at 
parallelizing the full block report... if we can do that, we can get a dramatic 
improvement in FBR times by using more than 1 core.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-04-29 Thread Xinwei Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518976#comment-14518976
 ] 

Xinwei Qin  commented on HDFS-7836:
---

Hi [~cmccabe],  [~clamb], This is a very meaningful improvement. Is there any 
update or next plan about this JIRA? Could you list a summary of the meeting 
held on March 11th? 

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-09 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353386#comment-14353386
 ] 

Charles Lamb commented on HDFS-7836:


JOIN WEBEX MEETING
https://cloudera.webex.com/join/clamb  |  622 867 972


JOIN BY PHONE
1-650-479-3208 Call-in toll number (US/Canada)
Access code: 622 867 972

Global call-in numbers:
https://cloudera.webex.com/cloudera/globalcallin.php?serviceType=MCED=342142257tollFree=0
https://cloudera.webex.com/cloudera/globalcallin.php?serviceType=MCED=342142257tollFree=0


Can't join the meeting? Contact support here:
https://cloudera.webex.com/mc

IMPORTANT NOTICE: Please note that this WebEx service allows audio and other 
information sent during the session to be recorded, which may be discoverable 
in a legal matter. By joining this session, you automatically consent to such 
recordings. If you do not consent to being recorded, discuss your concerns with 
the host or do not join the session.


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-09 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353455#comment-14353455
 ] 

Charles Lamb commented on HDFS-7836:


If you are planning on attending the meeting in-person, please drop me an email 
so I have an idea of how large a CR to book.

Thanks.


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-06 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350878#comment-14350878
 ] 

Colin Patrick McCabe commented on HDFS-7836:


I don't think the block map size is really that easy to get at right now.  
Taking a heap dump on a big NN can take minutes... it's not something most 
sysadmins will let you do.  And the analysis is difficult... a lot of common 
heap analysis tools require tons of memory.

Anyway, we should probably add a JMX counter for the size(s) of the block map 
hash tables, and the number of entries, for tracking purposes.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-05 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349438#comment-14349438
 ] 

Charles Lamb commented on HDFS-7836:


We'll hold a design review meeting and discussion of this project next Weds, 
March 11th, 10am to 1pm (PDT) at the Cloudera offices in Palo Alto. I'll post 
webex information on this Jira before then. If you plan on attending in person, 
please send me a private email so I know how many people to expect.


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-05 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349599#comment-14349599
 ] 

Chris Nauroth commented on HDFS-7836:
-

The thing we're losing is the ability to measure memory consumption by the 
block map specifically.  The OS and standard JMX metrics are great for 
measuring the whole process in aggregate, but it would be nice to add another 
counter to compensate for the measurement we'd be losing.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-04 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347999#comment-14347999
 ] 

Colin Patrick McCabe commented on HDFS-7836:


I think it makes sense to have total heapsize in a JMX counter, if it's not 
already there somewhere.  It's a pretty easy number to get from the OS, 
although there are confounding factors like shared libraries and shared memory 
segments.  But in general, those should be minor contributors.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-04 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348237#comment-14348237
 ] 

stack commented on HDFS-7836:
-

The JVM by default publishes via JMX offheap usage in the 
java.lang.MemoryPool#Memory,NonHeapMemoryUsage bean. It has committed/max/used. 
If insufficient, could figure how to expose more for sure.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-03 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345012#comment-14345012
 ] 

Charles Lamb commented on HDFS-7836:


Yes, there would definitely be a webex available.


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-03 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344735#comment-14344735
 ] 

Yi Liu commented on HDFS-7836:
--

Good proposal, Colin and Charles. 

{quote}
Reduce lock contention for the FSNamesystem lock
Enable concurrent processing of block reports
{quote}

Maybe at the first phase, we can separate locks for block reports from FSN 
lock. Even so, It's a good improvement and include bulk of work. 
Then we can do further improvement to have multiple locks (stripes) for block 
reports . 

{quote}
Use offheap to reduce the Java heap size of the NameNode
{quote}
Generally we have some ways to use offheap, 1) directbuffer, it's not fit to 
our case. 2) Sun Unsafe. 3) Write native code and allocate memory ourselves. It 
 seems we are going to use #3?

{quote}
We probably want to support dynamically growing the hash table, to avoid 
putting too much of a
burden on the administrator to configure the size
{quote}

If so, we still need to have an initial capacity for the hash table, otherwise 
there are lots of rehash when the load factor reaches the threshold.
How about we do the same thing as what we currently do in java blockmap, using 
2% of totoal memory? Then we don't need the load factor and assume the table 
rows number are enough?


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345482#comment-14345482
 ] 

Colin Patrick McCabe commented on HDFS-7836:


bq. Generally we have some ways to use offheap, 1) directbuffer, it's not fit 
to our case. 2) Sun Unsafe. 3) Write native code and allocate memory ourselves. 
It seems we are going to use #3?

I think #2 is the best.  As we talked about earlier, writing more JNI (#3) just 
adds more platform dependencies that we don't want.  With regard to #1, 
allocating DirectByteBuffers is very slow.

bq. If so, we still need to have an initial capacity for the hash table, 
otherwise there are lots of rehash when the load factor reaches the threshold. 
How about we do the same thing as what we currently do in java blockmap, using 
2% of totoal memory? Then we don't need the load factor and assume the table 
rows number are enough?

That's an interesting idea.  Keep in mind, though, that this hash table will be 
off-heap.  So if we size the hash table based on the JVM heap size, it would be 
weird.  Honestly I think having a setting for this is easiest.  I also suspect 
that growing the off-heap hash table will be quicker than growing an on-heap 
hash table, since it won't trigger full GCs during resizing.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345475#comment-14345475
 ] 

Colin Patrick McCabe commented on HDFS-7836:


Thanks, Yi.  We'll definitely provide a webex.

I think moving the BRs out from under the FSN lock is more work than it seems, 
and difficult to do incrementally.  There are a lot of cases where we access 
blocks in FSNamesystem, assuming that the FSN lock is sufficient protection 
(or, to be pedantically correct, the FSDirectory lock in really recent 
versions).  Sure, we could have a separate BlockManager lock and take that, but 
then we'd have to wait for that lock, while holding the FSN lock, which would 
be disastrous for performance.  The locking change is going to have to be done 
as part of the sharding and other refactors in order to be beneficial.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-03 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345610#comment-14345610
 ] 

Chris Nauroth commented on HDFS-7836:
-

I have one additional operability concern.  With off-heaping, we're going to 
lose the ability to measure the size of the block map with well-known tools 
like {{jmap}}.  Do you have any thoughts on compensating for this?  Maybe we 
could publish new JMX counters with the off-heap size stats?

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-03 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344742#comment-14344742
 ] 

Yi Liu commented on HDFS-7836:
--

{quote}
Charles Lamb is going to be around on 3/10, 3/11, and 3/12... does any of those 
days work for you for having a meetup? Perhaps we can combine our efforts, if 
that turns out to make sense.
{quote}
Could you schedule webex or lync too? Then people like me can join remotely if 
we get time :)

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-03-02 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343695#comment-14343695
 ] 

Colin Patrick McCabe commented on HDFS-7836:


bq. Arpit said: We do use one RPC per storage when the block count is over 1M 
viz. DFS_BLOCKREPORT_SPLIT_THRESHOLD_DEFAULT. The math doesn't work since 
protobuf uses vint on the wire. 9M blocks ~ 64MB was seen empirically in a 
couple of different deployments. It was used as the basis for the default of 1M.

Ah, thanks for that.  You are right that my math was off, because I was not 
considering how vints were encoded.

I was not aware that we have the ability to split full block reports (FBRs) 
into multiple RPCs at the moment.  I knew that we were releasing the lock after 
processing each storage, but I didn't realize the ability existed to do 
separate RPCs as well.  I'll have to look into that.  It certainly will help 
out a lot when it comes to reducing RPC size.  One interesting question is, how 
much testing has the split code received?  I think DNs with over 1 million 
blocks are still rare.

bq. Arpit wrote: With sequential allocation, a job with that does 'create N 
files, delete M files, repeat' could cause that imbalance over time.

Maybe I am missing something, but I don't see a realistic scenario where this 
could happen.  MR jobs are pretty much all multi-threaded (unless you are 
running a single node test cluster) and block allocations are not going to be 
orderly... the random noise in the system like random RPC delays, locking 
delays, and so forth will ensure that we don't see a particular mapper or 
reducer get all blocks mod 5 (or any other simple pattern like that).  Of 
course the cost of a more sophisticated hash function might be low enough that 
we should just do it anyway.

bq. Daryn said: Colin, please take a look and provide feedback on the doc I 
linked on HDFS-6658. It's been a multi-month effort to find a performant 
implementation. It's smaller in scope than the design here, but very similar in 
a number of ways.

In general, I didn't realize that HDFS-6658 was still being actively worked on 
since it was dormant (no comments) for 5 months until last Friday or so.  I did 
review it earlier but I had some concerns about running out of datanode 
indices.  I think those could be addressed (and might have been in the latest 
version), but only at the cost of adding a garbage-collection-style system.  
Also, as you mention, the scope is a lot smaller than this-- HDFS-6658 is a 
much more short-term solution.

[~clamb] is going to be around on 3/10, 3/11, and 3/12... does any of those 
days work for you for having a meetup?  Perhaps we can combine our efforts, if 
that turns out to make sense.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-27 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340474#comment-14340474
 ] 

Daryn Sharp commented on HDFS-7836:
---

Colin, please take a look and provide feedback on the doc I linked on 
HDFS-6658.  It's been a multi-month effort to find a performant implementation. 
 It's smaller in scope than the design here, but very similar in a number of 
ways.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-27 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341074#comment-14341074
 ] 

Arpit Agarwal commented on HDFS-7836:
-

bq. 1 M blocks per disk, on 10 disks, and 24 bytes per block, is a 240 MB block 
report (did I do that math right?) That's definitely bigger than we'd like the 
full BR RPC to be, and compression can help here. Or possibly separating the 
block report into multiple RPCs. Perhaps one RPC per storage?
We do use one RPC per storage when the block count is over 1M viz. 
{{DFS_BLOCKREPORT_SPLIT_THRESHOLD_DEFAULT}}. The math doesn't work since 
protobuf uses vint on the wire. _9M blocks ~ 64MB_ was seen empirically in a 
couple of different deployments. It was used as the basis for the default of 1M.

bq. Hmm. Our sequential block allocations should guarantee that mod N produces 
an approximately equal number of blocks in each stripe. It is only with 
randomly allocated block IDs that we could even theoretically get an imbalance 
(although the probability is vanishingly small even there if the randomness is 
uniform.). With sequentially allocated block IDs the stripes will always be of 
equal size. I guess deletions of blocks could change that, but I see no reason 
why any group of blocks mod N should be more deleted than another group.
With sequential allocation, a job with that does 'create N files, delete M 
files, repeat' could cause that imbalance over time.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338394#comment-14338394
 ] 

Charles Lamb commented on HDFS-7836:


Hi [~arpit99],

Thanks for reading over the design doc and commenting on it.

bq. The DataNode can now split block reports per storage directory (post 
HDFS-2832), controlled by DFS_BLOCKREPORT_SPLIT_THRESHOLD_KEY. Did you get a 
chance to try it out and see if it helps? Splitting reports addresses all of 
the above. (edit: does not address network bandwidth gains from compression 
though)

I think you may mean your work on HDFS-5153, right? If I understand that 
correctly, it sends one report per storage. We have seen block reports in the 
100MB+ sizes so we suspect that an even small chunksize than a storage may 
yield benefits. That said, I am also watching [~daryn]'s work on HDFS-7435 
which addresses a lot of this piece of this Jira's proposal. I think that once 
HDFS-7435 is committed, we will make some measurements and see if anything else 
in the area of chunking is necessary. As you point out, compression should also 
help.

bq. Do you have any estimates for startup time overhead due to GCs?

We know of at least one large deployment which experiences a full GC pause 
during startup. I'm not sure of the time, but in general, the off-heaping will 
help with NN throughput just by reducing the number of objects on the heap.

bq. How does this affect block report processing? We cannot assume DataNodes 
will sort blocks by target stripe. Will the NameNode sort received reports or 
will it acquire+release a lock per block? If the former, then there should 
probably be some randomization of order across threads to avoid unintended 
serialization e.g. lock convoys.

The idea is that currently, processing a block report requires taking the FSN 
lock. So this proposal is two part. First, use better locking semantics so that 
we don't have to take the FSN lock. Next, shard the blocksMap structure so that 
multiple threads can operate concurrently on that structure. Even if we 
continue to process BRs under one big happy FSN lock, having multiple threads 
operate concurrently will yield benefits. The sharding (stripes) is along 
arbitrary boundaries. For instance, the design doc suggests that it could be 
striped by doing blockId % nStripes. nStripes would be configurable to a 
relatively small number (the dd suggests 4 to 16), and if the modulo 
calculation is used, then nStripes would be a prime that is roughly equal to 
the number of threads available. As long as block report processing per block 
does not need to access more than one shard at a time, this will be fine -- 
multiple threads can process blocks in parallel. It is a technique that 
Berkeley DB Java Edition uses for its lock table to improve concurrency.


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338395#comment-14338395
 ] 

Charles Lamb commented on HDFS-7836:


Hi [~arpit99],

Thanks for reading over the design doc and commenting on it.

bq. The DataNode can now split block reports per storage directory (post 
HDFS-2832), controlled by DFS_BLOCKREPORT_SPLIT_THRESHOLD_KEY. Did you get a 
chance to try it out and see if it helps? Splitting reports addresses all of 
the above. (edit: does not address network bandwidth gains from compression 
though)

I think you may mean your work on HDFS-5153, right? If I understand that 
correctly, it sends one report per storage. We have seen block reports in the 
100MB+ sizes so we suspect that an even small chunksize than a storage may 
yield benefits. That said, I am also watching [~daryn]'s work on HDFS-7435 
which addresses a lot of this piece of this Jira's proposal. I think that once 
HDFS-7435 is committed, we will make some measurements and see if anything else 
in the area of chunking is necessary. As you point out, compression should also 
help.

bq. Do you have any estimates for startup time overhead due to GCs?

We know of at least one large deployment which experiences a full GC pause 
during startup. I'm not sure of the time, but in general, the off-heaping will 
help with NN throughput just by reducing the number of objects on the heap.

bq. How does this affect block report processing? We cannot assume DataNodes 
will sort blocks by target stripe. Will the NameNode sort received reports or 
will it acquire+release a lock per block? If the former, then there should 
probably be some randomization of order across threads to avoid unintended 
serialization e.g. lock convoys.

The idea is that currently, processing a block report requires taking the FSN 
lock. So this proposal is two part. First, use better locking semantics so that 
we don't have to take the FSN lock. Next, shard the blocksMap structure so that 
multiple threads can operate concurrently on that structure. Even if we 
continue to process BRs under one big happy FSN lock, having multiple threads 
operate concurrently will yield benefits. The sharding (stripes) is along 
arbitrary boundaries. For instance, the design doc suggests that it could be 
striped by doing blockId % nStripes. nStripes would be configurable to a 
relatively small number (the dd suggests 4 to 16), and if the modulo 
calculation is used, then nStripes would be a prime that is roughly equal to 
the number of threads available. As long as block report processing per block 
does not need to access more than one shard at a time, this will be fine -- 
multiple threads can process blocks in parallel. It is a technique that 
Berkeley DB Java Edition uses for its lock table to improve concurrency.


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339007#comment-14339007
 ] 

Chris Nauroth commented on HDFS-7836:
-

bq. With regard to compatibility... even if the NN max heap size is unchanged, 
the NN will not actually consume that memory until it needs it, right? So 
existing configuration should work just fine. I guess the one exception might 
be if you set a super-huge minimum (not maximum) JVM heap size.

Yes, that's a correct description of the behavior, but I typically deploy with 
{{-Xms}} and {{-Xmx}} configured to the same value.  In my experience, this has 
been a more stable configuration for long-running Java processes with a large 
heap like the NameNode.  It can prevent unpredictable costly allocations later 
in the process lifetime, when they aren't expected.  Also, there is no 
guarantee that there would be sufficient memory to satisfy that allocation 
later in the process lifetime, which would result in the process terminating.  
You could also run afoul of the OOM killer after weeks of a seemingly stable 
run.  The JVM never frees the memory it allocates when the heap grows, so you 
end up needing to plan for the worst case anyway.  I prefer the fail fast 
behavior of trying to take all the memory at JVM process startup.  Of course, 
this circumvents any opportunity to run with a smaller memory footprint than 
the max, but I find that's generally the right trade-off for servers where it's 
typical to run only a very well-defined set of processes on the box and plan 
strict resource allocations for each one.

bq. There are a few big users who can shave several minutes off of their NN 
startup time by setting the NN minimum heap size to something large. This 
prevents successive rounds of stop the world GC where we copy everything to a 
new, 2x as large heap.

That's another example of why setting {{-Xms}} equal to {{-Xmx}} can be helpful.

These are the kinds of configurations where I have a potential compatibility 
concern.  If someone is running multiple daemons on the box, and they have 
carefully carved out exact heap allocations by setting {{-Xms}} equal to 
{{-Xmx}} on each process, then moving data off-heap leaves their eager large 
heap allocation unused and potentially exceeds total available RAM.

bq. As a practical matter, we have found that everyone who has a big NN heap 
has a big NN machine, which has much more memory than we can even use 
currently. So I would not expect this to be a problem in practice.

I'll dig into this more on my side too and try to get more details on whether 
or not this really could be a likely problem in practice.  I'd be curious to 
hear from others in the community too.  (There are a lot of ways to run 
operations for a Hadoop cluster, sometimes with differing opinions on 
configuration best practices.)

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338901#comment-14338901
 ] 

Charles Lamb commented on HDFS-7836:


bq. Are you proposing that off-heaping is an opt-in feature that must be 
explicitly enabled in configuration, or are you proposing that off-heaping will 
be the new default behavior? Arguably, jumping to off-heaping as the default 
could be seen as a backwards-incompatibility, because it might be unsafe to 
deploy the feature without simultaneous down-tuning the NameNode max heap size. 
Some might see that as backwards-incompatible with existing configurations.

The proposal is to have an option that lets the offheap code allocate slabs 
using 'new byte[]' rather than malloc. This would be used for debugging 
purposes and not in a normal deployment.


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338926#comment-14338926
 ] 

Colin Patrick McCabe commented on HDFS-7836:


Thanks for taking a look at this, [~cnauroth] and [~arpitagarwal].

bq. Chris wrote: Do you intend to enforce an upper limit on growth of the 
off-heap allocation? If so, do you see this as a new configuration property or 
as a function of an existing parameter (i.e. equal to max heap with the 
consideration that the block map takes ~50% of the heap now)? -Xmx alone will 
no longer be sufficient to define a ceiling for RAM utilization by the NameNode 
process. This can be important in deployments that choose to co-locate other 
Hadoop daemons on the same hosts as the NameNode.

That's an interesting question.  It is certainly possible to put an upper limit 
on the growth of the total process heap (Java + non-Java) by using {{ulimit 
\-m}}, on any UNIX-like OS.  I'm not sure that I would recommend this 
configuration, since the behavior for when the ulimit is exceeded is that the 
process is terminated.  I think that it would be better in most cases to have 
the management software running on the cluster examine heap sizes, and warn 
when memory is getting low.

bq. Can I take this to mean that there will be no new native code written as 
part of this project? Of course, we can always do a native code implementation 
later if use of a private Sun API becomes problematic, but I wanted to 
understand the code footprint for the current proposal. Avoiding native code 
entirely would be nice, because it reduces the scope of testing efforts across 
multiple platforms.

Yeah, we are hoping to avoid writing any JNI code here.  So far, it looks good 
on that front.

bq. Are you proposing that off-heaping is an opt-in feature that must be 
explicitly enabled in configuration, or are you proposing that off-heaping will 
be the new default behavior? Arguably, jumping to off-heaping as the default 
could be seen as a backwards-incompatibility, because it might be unsafe to 
deploy the feature without simultaneous down-tuning the NameNode max heap size. 
Some might see that as backwards-incompatible with existing configurations.

I think off-heaping should be on by default.  HDFS gets enough bad press from 
having short-circuit and other optimizations turned off by default... we should 
be a little nicer this time :)

With regard to compatibility... even if the NN max heap size is unchanged, the 
NN will not actually consume that memory until it needs it, right?  So existing 
configuration should work just fine.  I guess the one exception might be if you 
set a super-huge *minimum* (not maximum) JVM heap size.  But even in that case, 
I would expect things to work, since virtual memory would pick up the slack.  
Unless you have turned off swapping, but that's not recommended.  As a 
practical matter, we have found that everyone who has a big NN heap has a big 
NN machine, which has much more memory than we can even use currently.  So I 
would not expect this to be a problem in practice.

bq. Arpit asked: Do you have any estimates for startup time overhead due to GCs?

There are a few big users who can shave several minutes off of their NN startup 
time by setting the NN minimum heap size to something large.  This prevents 
successive rounds of stop the world GC where we copy everything to a new, 2x as 
large heap.  We will get some before and after startup numbers later that 
should illustrate this even more clearly.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338865#comment-14338865
 ] 

Chris Nauroth commented on HDFS-7836:
-

+1 for the proposal overall.  Thanks for the great write-up, Charles and Colin. 
 I have questions on a few details:

bq. The Java heap size will be reduced, because the BlockManager data will be 
off­-heap.

Do you intend to enforce an upper limit on growth of the off-heap allocation?  
If so, do you see this as a new configuration property or as a function of an 
existing parameter (i.e. equal to max heap with the consideration that the 
block map takes ~50% of the heap now)?  {{-Xmx}} alone will no longer be 
sufficient to define a ceiling for RAM utilization by the NameNode process.  
This can be important in deployments that choose to co-locate other Hadoop 
daemons on the same hosts as the NameNode.

bq. malloc is also accessible without the use of JNI via the Unsafe package.

Can I take this to mean that there will be no new native code written as part 
of this project?  Of course, we can always do a native code implementation 
later if use of a private Sun API becomes problematic, but I wanted to 
understand the code footprint for the current proposal.  Avoiding native code 
entirely would be nice, because it reduces the scope of testing efforts across 
multiple platforms.

bq. The off­heaping code should have the option to use on-heap memory.

Are you proposing that off-heaping is an opt-in feature that must be explicitly 
enabled in configuration, or are you proposing that off-heaping will be the new 
default behavior?  Arguably, jumping to off-heaping as the default could be 
seen as a backwards-incompatibility, because it might be unsafe to deploy the 
feature without simultaneous down-tuning the NameNode max heap size.  Some 
might see that as backwards-incompatible with existing configurations.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339258#comment-14339258
 ] 

Colin Patrick McCabe commented on HDFS-7836:


Thanks, Chris.  I agree that \-Xmx == \-Xms is often a useful configuration.  
It would be interesting to get more information about how much change in those 
settings will be necessary in general with off-heap.  We'll have to think about 
that.

I created a branch for this work, HDFS-7836.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339353#comment-14339353
 ] 

Arpit Agarwal commented on HDFS-7836:
-

Thanks for the responses Charles and Colin.

bq. We have seen block reports in the 100MB+ sizes so we suspect that an even 
small chunksize than a storage may yield benefits. 
IIRC the default 64MB protobuf message limit is hit at 9M blocks. Even with a 
hypothetical 10TB disk a low average block size of 10MB, we get 1M blocks/disk 
in the foreseeable future. With splitting that gets you to a reasonable ~7MB 
block report per disk. I am not saying no to chunking/compression but it would 
be useful to see some perf comparison before we add that complexity.

In the past I used CreateEditsLog to generate files and a simple shell script 
to generate block files on DataNodes to simulate millions of blocks. Not as 
convenient as junit but I'll see if I can clean up and post what I used on 
HDFS-7847.

bq.  So this proposal is two part. First, use better locking semantics so that 
we don't have to take the FSN lock. 
bq. Even if we continue to process BRs under one big happy FSN lock, having 
multiple threads operate concurrently will yield benefits. 
These two sound contradictory. I assume the former is correct and we won't 
really take the FSN lock. Also I did not get how you will process one stripe at 
a time without repeatedly locking and unlocking, since DataNodes wouldn't know 
about the block to stripe mapping to order the reports. I guess I will wait to 
see the code.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339385#comment-14339385
 ] 

Charles Lamb commented on HDFS-7836:


bq. it would be useful to see some perf comparison before we add that 
complexity.

We definitely plan on getting some baseline measurements and sharing them. We 
definitely want to know what the before and after effects are of any changes. 
As an aside, I worked on a case where we had to increase the RPC limit to 192MB 
in order to get block reports handled correctly so I know this type of 
deployments are out there.

bq. I'll see if I can clean up and post what I used on HDFS-7847.

That would be much appreciated. I'm starting to look at HDFS-7847 (subtask of 
this Jira) and maybe that could come into play somehow.

bq. These two sound contradictory.

Yes, they do, but aren't meant to be. The first level would be to do concurrent 
processing under the FSN lock. That would at least get some parallelism. The 
second step would be to make a more lockless blocksMap which wouldn't require 
the big FSN lock to be held.

BTW, since you have edit privs, if you want to get rid of my reduntent reply to 
you above that would be great. My browser suckered me into hitting Add twice. 
Thanks.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339695#comment-14339695
 ] 

Colin Patrick McCabe commented on HDFS-7836:


bq. The hashing scheme should probably not be that simple \[as mod 5\]. Block 
IDs are sequentially allocated so it is not hard to think of pathological app 
behavior causing skewed block distribution across stripes over time.

Hmm.  Our sequential block allocations should guarantee that mod N produces an 
approximately equal number of blocks in each stripe.  It is only with randomly 
allocated block IDs that we could even theoretically get an imbalance (although 
the probability is vanishingly small even there if the randomness is uniform.). 
 With sequentially allocated block IDs the stripes will always be of equal 
size.  I guess deletions of blocks could change that, but I see no reason why 
any group of blocks mod N should be more deleted than another group.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-26 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339693#comment-14339693
 ] 

Colin Patrick McCabe commented on HDFS-7836:


bq. These two sound contradictory. I assume the former is correct and we won't 
really take the FSN lock. Also I did not get how you will process one stripe at 
a time without repeatedly locking and unlocking, since DataNodes wouldn't know 
about the block to stripe mapping to order the reports. I guess I will wait to 
see the code. 

Let me clarify a bit.  It may be necessary to take the FSN lock very briefly 
once at the start or end of the block report, but in general, we do not want to 
do the block report under the FSN lock as it is done currently.  The main 
processing of blocks should not happen under the FSN lock.

The idea is to separate the BlocksMap into multiple stripes.  Which blocks go 
into which stripes is determined by blockID.  Something like blockID mod 5 
would work for this.  Clearly, the incoming block report will contain work for 
several stripes.  As you mentioned, that necessitates locking and unlocking.  
But we can do multiple blocks each time we grab a stripe lock.  We simply 
accumulate a bunch of blocks that we know are in each stripe in a per-stripe 
buffer (maybe we do 1000 blocks at a time in between lock release... maybe a 
little more).  This is something that we can tune to make a tradeoff between 
latency and throughput.  This also means that we will have to be able to handle 
operations on blocks, new blocks being created, etc. during a block report.

bq. IIRC the default 64MB protobuf message limit is hit at 9M blocks. Even with 
a hypothetical 10TB disk and a low average block size of 10MB, we get 1M 
blocks/disk in the foreseeable future. With splitting that gets you to a 
reasonable ~7MB block report per disk. I am not saying no to 
chunking/compression but it would be useful to see some perf comparison before 
we add that complexity.

1 M blocks per disk, on 10 disks, and 24 bytes per block, is a 240 MB block 
report (did I do that math right?)  That's definitely bigger than we'd like the 
full BR RPC to be, and compression can help here.  Or possibly separating the 
block report into multiple RPCs.  Perhaps one RPC per storage?

Incidentally, there's nothing magic about the 64 MB RPC size.  I originally 
added that number as the max RPC size when I did HADOOP-9676, just by choosing 
a large power of 2 that seemed way too big for a real, non-corrupt RPC message. 
:)  But big RPCs are bad in general because of the way Hadoop IPC works.  To 
improve our latency, we probably want a block report RPC size much smaller 64 
MB.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-25 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337977#comment-14337977
 ] 

Arpit Agarwal commented on HDFS-7836:
-

Hi Charles, thank you for posting the design doc. This is a very interesting 
proposal. A few comments:
bq. Full block reports are problematic. The more blocks each DataNode has, the 
longer it takes to process a full block report from that DataNode.
bq. Since the protobufs code encodes and decodes from Java arrays, this 
necessitates the allocation of a very large java array, which may cause garbage 
collection issues. By splitting block reports into multiple arrays, we can 
avoid these issues.
bq. It will eliminate the long pauses that we have now as a consequence of full 
block reports coming in
bq. We can compress block reports via lz4 or gzip.
The DataNode can now split block reports per storage directory (post 
HDFS-2832), controlled by {{DFS_BLOCKREPORT_SPLIT_THRESHOLD_KEY}}. Did you get 
a chance to try it out and see if it helps? Splitting reports addresses all of 
the above.
bq. We have found that large Java heap sizes lead to very long startup times 
for the NameNode. This is because multiple full GCs often occur during startup 
... Reducing the size of the Java heap will improve the NameNode startup time.
Do you have any estimates for startup time overhead due to GCs?
bq. The Java heap size will be reduced, because the BlockManager data will be 
off­heap. We have found that BlockManager data makes up about 50% of a typical 
NameNode heap.
Thanks for mentioning this in the doc, it's a significant point.
bq. We will shard the BlocksMap into several “lock stripes.” The number of lock 
stripes should be matched to the number of CPUs available and the degree of 
concurrency in the workload. 
How does this affect block report processing? We cannot assume DataNodes will 
sort blocks by target stripe. Will the NameNode sort received reports or will 
it acquire+release a lock per block? If the former, then there should probably 
be some randomization of order across threads to avoid unintended serialization 
e.g. lock convoys.
bq. Block IDs would be sharded into stripes based on a simple algorithm 
­­possibly modulus.
The scheme should probably not be that simple. Block IDs are sequentially 
allocated so it is not hard to think of pathological app behavior causing 
skewed block distribution across stripes over time.
bq. We will use the “flyweight pattern” to permit access to the off­heap memory 
by wrapping it with a short­lived Java object containing appropriate accessors 
and mutators.
It will be interesting to measure the impact on GC and CPU, if any.

 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb
 Attachments: BlockManagerScalabilityImprovementsDesign.pdf


 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements

2015-02-24 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335747#comment-14335747
 ] 

Charles Lamb commented on HDFS-7836:


Problem Statement

The number of blocks stored by the largest HDFS clusters continues to increase. 
 This increase adds pressure to the BlockManager, that part of the NameNode 
which handles block data from across the cluster.

Full block reports are problematic.  The more blocks each DataNode has, the 
longer it takes to process a full block report from that DataNode.  Storage 
densities have roughly doubled each year for the past few years.  Meanwhile, 
increases in CPU power have come mostly in the form of additional cores rather 
than faster clock speeds.  Currently, the NameNode cannot use these additional 
cores because full block reports are processed while holding the namesystem 
lock.

The BlockManager stores all blocks in memory and this contributes to a large 
heap size.  As the NameNode Java heap size has grown, full garbage collection 
events have started to take several minutes.  Although it is often possible to 
avoid full GCs by re-using Java objects, they remain an operational concern for 
administrators.  They also contribute to a long NameNode startup time, 
sometimes measured in tens of minutes for the biggest clusters.


Goals
We need to improve the BlockManager to handle the challenges of the next few 
years.  Our specific goals for this project are to:

* Reduce lock contention for the FSNamesystem lock
* Enable concurrent processing of block reports
* Reduce the Java heap size of the NameNode
* Optimize the use of network resources

[~cmccabe] and I will be working on this Jira. We propose doing this work on a 
separate branch. If there is interest in a community meeting to discuss these 
changes, then perhaps Tuesday 3/10/15 at Cloudera in Palo Alto, CA would work? 
I suggest that date because I will be in the bay area that day and would like 
to meet with other interested community members in person. I'll also be around 
3/11 and 3/12 if we need an alternate date.


 BlockManager Scalability Improvements
 -

 Key: HDFS-7836
 URL: https://issues.apache.org/jira/browse/HDFS-7836
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Charles Lamb
Assignee: Charles Lamb

 Improvements to BlockManager scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)