[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522480#comment-14522480 ] Colin Patrick McCabe commented on HDFS-7836: Hi [~xinwei], The discussion on March 11th was focused on our proposal for off-heaping and parallelizing the block manager from February 24th. We spent a lot of time going through the proposal and responding to questions on the proposal. There was widespread agreement that we needed to reduce the garbage collection impact of the millions of BlockInfoContiguous structures. There was some disagreement about how to do that. Daryn argued that using large primitive arrays was the best way to go. Charles and I argued that using off-heap storage was better. The main advantage of large primitive arrays is that it makes the existing Java \-Xmx memory settings work as expected. The main advantage of off-heap is that it allows the use of things like {{Unsafe#compareAndSwap}}, which can often lead to more efficient concurrent data structures. Also, when using off-heap memory, we get to re-use malloc rather than essentially writing our own malloc for every subsystem. There was some hand-wringing about off-heap memory being slower, but I do not believe that this is valid. Apache Spark has found that their off-heap hash table was actually faster than the on-heap one, due to the ability to better control the memory layout. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html The key is to avoid using {{DirectByteBuffer}}, which is rather slow, and use {{Unsafe}} instead. However, Daryn has posted some patches using the large arrays approach. Since they are a nice incremental improvement, we are probably going to pick them up if there are no blockers. We are also looking at incremental improvements such as implementing backpressure for full block reports, and speeding up edit log replay (if possible). I would also like to look at parallelizing the full block report... if we can do that, we can get a dramatic improvement in FBR times by using more than 1 core. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518976#comment-14518976 ] Xinwei Qin commented on HDFS-7836: --- Hi [~cmccabe], [~clamb], This is a very meaningful improvement. Is there any update or next plan about this JIRA? Could you list a summary of the meeting held on March 11th? BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353386#comment-14353386 ] Charles Lamb commented on HDFS-7836: JOIN WEBEX MEETING https://cloudera.webex.com/join/clamb | 622 867 972 JOIN BY PHONE 1-650-479-3208 Call-in toll number (US/Canada) Access code: 622 867 972 Global call-in numbers: https://cloudera.webex.com/cloudera/globalcallin.php?serviceType=MCED=342142257tollFree=0 https://cloudera.webex.com/cloudera/globalcallin.php?serviceType=MCED=342142257tollFree=0 Can't join the meeting? Contact support here: https://cloudera.webex.com/mc IMPORTANT NOTICE: Please note that this WebEx service allows audio and other information sent during the session to be recorded, which may be discoverable in a legal matter. By joining this session, you automatically consent to such recordings. If you do not consent to being recorded, discuss your concerns with the host or do not join the session. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353455#comment-14353455 ] Charles Lamb commented on HDFS-7836: If you are planning on attending the meeting in-person, please drop me an email so I have an idea of how large a CR to book. Thanks. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350878#comment-14350878 ] Colin Patrick McCabe commented on HDFS-7836: I don't think the block map size is really that easy to get at right now. Taking a heap dump on a big NN can take minutes... it's not something most sysadmins will let you do. And the analysis is difficult... a lot of common heap analysis tools require tons of memory. Anyway, we should probably add a JMX counter for the size(s) of the block map hash tables, and the number of entries, for tracking purposes. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349438#comment-14349438 ] Charles Lamb commented on HDFS-7836: We'll hold a design review meeting and discussion of this project next Weds, March 11th, 10am to 1pm (PDT) at the Cloudera offices in Palo Alto. I'll post webex information on this Jira before then. If you plan on attending in person, please send me a private email so I know how many people to expect. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349599#comment-14349599 ] Chris Nauroth commented on HDFS-7836: - The thing we're losing is the ability to measure memory consumption by the block map specifically. The OS and standard JMX metrics are great for measuring the whole process in aggregate, but it would be nice to add another counter to compensate for the measurement we'd be losing. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347999#comment-14347999 ] Colin Patrick McCabe commented on HDFS-7836: I think it makes sense to have total heapsize in a JMX counter, if it's not already there somewhere. It's a pretty easy number to get from the OS, although there are confounding factors like shared libraries and shared memory segments. But in general, those should be minor contributors. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348237#comment-14348237 ] stack commented on HDFS-7836: - The JVM by default publishes via JMX offheap usage in the java.lang.MemoryPool#Memory,NonHeapMemoryUsage bean. It has committed/max/used. If insufficient, could figure how to expose more for sure. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345012#comment-14345012 ] Charles Lamb commented on HDFS-7836: Yes, there would definitely be a webex available. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344735#comment-14344735 ] Yi Liu commented on HDFS-7836: -- Good proposal, Colin and Charles. {quote} Reduce lock contention for the FSNamesystem lock Enable concurrent processing of block reports {quote} Maybe at the first phase, we can separate locks for block reports from FSN lock. Even so, It's a good improvement and include bulk of work. Then we can do further improvement to have multiple locks (stripes) for block reports . {quote} Use offheap to reduce the Java heap size of the NameNode {quote} Generally we have some ways to use offheap, 1) directbuffer, it's not fit to our case. 2) Sun Unsafe. 3) Write native code and allocate memory ourselves. It seems we are going to use #3? {quote} We probably want to support dynamically growing the hash table, to avoid putting too much of a burden on the administrator to configure the size {quote} If so, we still need to have an initial capacity for the hash table, otherwise there are lots of rehash when the load factor reaches the threshold. How about we do the same thing as what we currently do in java blockmap, using 2% of totoal memory? Then we don't need the load factor and assume the table rows number are enough? BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345482#comment-14345482 ] Colin Patrick McCabe commented on HDFS-7836: bq. Generally we have some ways to use offheap, 1) directbuffer, it's not fit to our case. 2) Sun Unsafe. 3) Write native code and allocate memory ourselves. It seems we are going to use #3? I think #2 is the best. As we talked about earlier, writing more JNI (#3) just adds more platform dependencies that we don't want. With regard to #1, allocating DirectByteBuffers is very slow. bq. If so, we still need to have an initial capacity for the hash table, otherwise there are lots of rehash when the load factor reaches the threshold. How about we do the same thing as what we currently do in java blockmap, using 2% of totoal memory? Then we don't need the load factor and assume the table rows number are enough? That's an interesting idea. Keep in mind, though, that this hash table will be off-heap. So if we size the hash table based on the JVM heap size, it would be weird. Honestly I think having a setting for this is easiest. I also suspect that growing the off-heap hash table will be quicker than growing an on-heap hash table, since it won't trigger full GCs during resizing. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345475#comment-14345475 ] Colin Patrick McCabe commented on HDFS-7836: Thanks, Yi. We'll definitely provide a webex. I think moving the BRs out from under the FSN lock is more work than it seems, and difficult to do incrementally. There are a lot of cases where we access blocks in FSNamesystem, assuming that the FSN lock is sufficient protection (or, to be pedantically correct, the FSDirectory lock in really recent versions). Sure, we could have a separate BlockManager lock and take that, but then we'd have to wait for that lock, while holding the FSN lock, which would be disastrous for performance. The locking change is going to have to be done as part of the sharding and other refactors in order to be beneficial. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345610#comment-14345610 ] Chris Nauroth commented on HDFS-7836: - I have one additional operability concern. With off-heaping, we're going to lose the ability to measure the size of the block map with well-known tools like {{jmap}}. Do you have any thoughts on compensating for this? Maybe we could publish new JMX counters with the off-heap size stats? BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344742#comment-14344742 ] Yi Liu commented on HDFS-7836: -- {quote} Charles Lamb is going to be around on 3/10, 3/11, and 3/12... does any of those days work for you for having a meetup? Perhaps we can combine our efforts, if that turns out to make sense. {quote} Could you schedule webex or lync too? Then people like me can join remotely if we get time :) BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343695#comment-14343695 ] Colin Patrick McCabe commented on HDFS-7836: bq. Arpit said: We do use one RPC per storage when the block count is over 1M viz. DFS_BLOCKREPORT_SPLIT_THRESHOLD_DEFAULT. The math doesn't work since protobuf uses vint on the wire. 9M blocks ~ 64MB was seen empirically in a couple of different deployments. It was used as the basis for the default of 1M. Ah, thanks for that. You are right that my math was off, because I was not considering how vints were encoded. I was not aware that we have the ability to split full block reports (FBRs) into multiple RPCs at the moment. I knew that we were releasing the lock after processing each storage, but I didn't realize the ability existed to do separate RPCs as well. I'll have to look into that. It certainly will help out a lot when it comes to reducing RPC size. One interesting question is, how much testing has the split code received? I think DNs with over 1 million blocks are still rare. bq. Arpit wrote: With sequential allocation, a job with that does 'create N files, delete M files, repeat' could cause that imbalance over time. Maybe I am missing something, but I don't see a realistic scenario where this could happen. MR jobs are pretty much all multi-threaded (unless you are running a single node test cluster) and block allocations are not going to be orderly... the random noise in the system like random RPC delays, locking delays, and so forth will ensure that we don't see a particular mapper or reducer get all blocks mod 5 (or any other simple pattern like that). Of course the cost of a more sophisticated hash function might be low enough that we should just do it anyway. bq. Daryn said: Colin, please take a look and provide feedback on the doc I linked on HDFS-6658. It's been a multi-month effort to find a performant implementation. It's smaller in scope than the design here, but very similar in a number of ways. In general, I didn't realize that HDFS-6658 was still being actively worked on since it was dormant (no comments) for 5 months until last Friday or so. I did review it earlier but I had some concerns about running out of datanode indices. I think those could be addressed (and might have been in the latest version), but only at the cost of adding a garbage-collection-style system. Also, as you mention, the scope is a lot smaller than this-- HDFS-6658 is a much more short-term solution. [~clamb] is going to be around on 3/10, 3/11, and 3/12... does any of those days work for you for having a meetup? Perhaps we can combine our efforts, if that turns out to make sense. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340474#comment-14340474 ] Daryn Sharp commented on HDFS-7836: --- Colin, please take a look and provide feedback on the doc I linked on HDFS-6658. It's been a multi-month effort to find a performant implementation. It's smaller in scope than the design here, but very similar in a number of ways. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341074#comment-14341074 ] Arpit Agarwal commented on HDFS-7836: - bq. 1 M blocks per disk, on 10 disks, and 24 bytes per block, is a 240 MB block report (did I do that math right?) That's definitely bigger than we'd like the full BR RPC to be, and compression can help here. Or possibly separating the block report into multiple RPCs. Perhaps one RPC per storage? We do use one RPC per storage when the block count is over 1M viz. {{DFS_BLOCKREPORT_SPLIT_THRESHOLD_DEFAULT}}. The math doesn't work since protobuf uses vint on the wire. _9M blocks ~ 64MB_ was seen empirically in a couple of different deployments. It was used as the basis for the default of 1M. bq. Hmm. Our sequential block allocations should guarantee that mod N produces an approximately equal number of blocks in each stripe. It is only with randomly allocated block IDs that we could even theoretically get an imbalance (although the probability is vanishingly small even there if the randomness is uniform.). With sequentially allocated block IDs the stripes will always be of equal size. I guess deletions of blocks could change that, but I see no reason why any group of blocks mod N should be more deleted than another group. With sequential allocation, a job with that does 'create N files, delete M files, repeat' could cause that imbalance over time. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338394#comment-14338394 ] Charles Lamb commented on HDFS-7836: Hi [~arpit99], Thanks for reading over the design doc and commenting on it. bq. The DataNode can now split block reports per storage directory (post HDFS-2832), controlled by DFS_BLOCKREPORT_SPLIT_THRESHOLD_KEY. Did you get a chance to try it out and see if it helps? Splitting reports addresses all of the above. (edit: does not address network bandwidth gains from compression though) I think you may mean your work on HDFS-5153, right? If I understand that correctly, it sends one report per storage. We have seen block reports in the 100MB+ sizes so we suspect that an even small chunksize than a storage may yield benefits. That said, I am also watching [~daryn]'s work on HDFS-7435 which addresses a lot of this piece of this Jira's proposal. I think that once HDFS-7435 is committed, we will make some measurements and see if anything else in the area of chunking is necessary. As you point out, compression should also help. bq. Do you have any estimates for startup time overhead due to GCs? We know of at least one large deployment which experiences a full GC pause during startup. I'm not sure of the time, but in general, the off-heaping will help with NN throughput just by reducing the number of objects on the heap. bq. How does this affect block report processing? We cannot assume DataNodes will sort blocks by target stripe. Will the NameNode sort received reports or will it acquire+release a lock per block? If the former, then there should probably be some randomization of order across threads to avoid unintended serialization e.g. lock convoys. The idea is that currently, processing a block report requires taking the FSN lock. So this proposal is two part. First, use better locking semantics so that we don't have to take the FSN lock. Next, shard the blocksMap structure so that multiple threads can operate concurrently on that structure. Even if we continue to process BRs under one big happy FSN lock, having multiple threads operate concurrently will yield benefits. The sharding (stripes) is along arbitrary boundaries. For instance, the design doc suggests that it could be striped by doing blockId % nStripes. nStripes would be configurable to a relatively small number (the dd suggests 4 to 16), and if the modulo calculation is used, then nStripes would be a prime that is roughly equal to the number of threads available. As long as block report processing per block does not need to access more than one shard at a time, this will be fine -- multiple threads can process blocks in parallel. It is a technique that Berkeley DB Java Edition uses for its lock table to improve concurrency. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338395#comment-14338395 ] Charles Lamb commented on HDFS-7836: Hi [~arpit99], Thanks for reading over the design doc and commenting on it. bq. The DataNode can now split block reports per storage directory (post HDFS-2832), controlled by DFS_BLOCKREPORT_SPLIT_THRESHOLD_KEY. Did you get a chance to try it out and see if it helps? Splitting reports addresses all of the above. (edit: does not address network bandwidth gains from compression though) I think you may mean your work on HDFS-5153, right? If I understand that correctly, it sends one report per storage. We have seen block reports in the 100MB+ sizes so we suspect that an even small chunksize than a storage may yield benefits. That said, I am also watching [~daryn]'s work on HDFS-7435 which addresses a lot of this piece of this Jira's proposal. I think that once HDFS-7435 is committed, we will make some measurements and see if anything else in the area of chunking is necessary. As you point out, compression should also help. bq. Do you have any estimates for startup time overhead due to GCs? We know of at least one large deployment which experiences a full GC pause during startup. I'm not sure of the time, but in general, the off-heaping will help with NN throughput just by reducing the number of objects on the heap. bq. How does this affect block report processing? We cannot assume DataNodes will sort blocks by target stripe. Will the NameNode sort received reports or will it acquire+release a lock per block? If the former, then there should probably be some randomization of order across threads to avoid unintended serialization e.g. lock convoys. The idea is that currently, processing a block report requires taking the FSN lock. So this proposal is two part. First, use better locking semantics so that we don't have to take the FSN lock. Next, shard the blocksMap structure so that multiple threads can operate concurrently on that structure. Even if we continue to process BRs under one big happy FSN lock, having multiple threads operate concurrently will yield benefits. The sharding (stripes) is along arbitrary boundaries. For instance, the design doc suggests that it could be striped by doing blockId % nStripes. nStripes would be configurable to a relatively small number (the dd suggests 4 to 16), and if the modulo calculation is used, then nStripes would be a prime that is roughly equal to the number of threads available. As long as block report processing per block does not need to access more than one shard at a time, this will be fine -- multiple threads can process blocks in parallel. It is a technique that Berkeley DB Java Edition uses for its lock table to improve concurrency. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339007#comment-14339007 ] Chris Nauroth commented on HDFS-7836: - bq. With regard to compatibility... even if the NN max heap size is unchanged, the NN will not actually consume that memory until it needs it, right? So existing configuration should work just fine. I guess the one exception might be if you set a super-huge minimum (not maximum) JVM heap size. Yes, that's a correct description of the behavior, but I typically deploy with {{-Xms}} and {{-Xmx}} configured to the same value. In my experience, this has been a more stable configuration for long-running Java processes with a large heap like the NameNode. It can prevent unpredictable costly allocations later in the process lifetime, when they aren't expected. Also, there is no guarantee that there would be sufficient memory to satisfy that allocation later in the process lifetime, which would result in the process terminating. You could also run afoul of the OOM killer after weeks of a seemingly stable run. The JVM never frees the memory it allocates when the heap grows, so you end up needing to plan for the worst case anyway. I prefer the fail fast behavior of trying to take all the memory at JVM process startup. Of course, this circumvents any opportunity to run with a smaller memory footprint than the max, but I find that's generally the right trade-off for servers where it's typical to run only a very well-defined set of processes on the box and plan strict resource allocations for each one. bq. There are a few big users who can shave several minutes off of their NN startup time by setting the NN minimum heap size to something large. This prevents successive rounds of stop the world GC where we copy everything to a new, 2x as large heap. That's another example of why setting {{-Xms}} equal to {{-Xmx}} can be helpful. These are the kinds of configurations where I have a potential compatibility concern. If someone is running multiple daemons on the box, and they have carefully carved out exact heap allocations by setting {{-Xms}} equal to {{-Xmx}} on each process, then moving data off-heap leaves their eager large heap allocation unused and potentially exceeds total available RAM. bq. As a practical matter, we have found that everyone who has a big NN heap has a big NN machine, which has much more memory than we can even use currently. So I would not expect this to be a problem in practice. I'll dig into this more on my side too and try to get more details on whether or not this really could be a likely problem in practice. I'd be curious to hear from others in the community too. (There are a lot of ways to run operations for a Hadoop cluster, sometimes with differing opinions on configuration best practices.) BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338901#comment-14338901 ] Charles Lamb commented on HDFS-7836: bq. Are you proposing that off-heaping is an opt-in feature that must be explicitly enabled in configuration, or are you proposing that off-heaping will be the new default behavior? Arguably, jumping to off-heaping as the default could be seen as a backwards-incompatibility, because it might be unsafe to deploy the feature without simultaneous down-tuning the NameNode max heap size. Some might see that as backwards-incompatible with existing configurations. The proposal is to have an option that lets the offheap code allocate slabs using 'new byte[]' rather than malloc. This would be used for debugging purposes and not in a normal deployment. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338926#comment-14338926 ] Colin Patrick McCabe commented on HDFS-7836: Thanks for taking a look at this, [~cnauroth] and [~arpitagarwal]. bq. Chris wrote: Do you intend to enforce an upper limit on growth of the off-heap allocation? If so, do you see this as a new configuration property or as a function of an existing parameter (i.e. equal to max heap with the consideration that the block map takes ~50% of the heap now)? -Xmx alone will no longer be sufficient to define a ceiling for RAM utilization by the NameNode process. This can be important in deployments that choose to co-locate other Hadoop daemons on the same hosts as the NameNode. That's an interesting question. It is certainly possible to put an upper limit on the growth of the total process heap (Java + non-Java) by using {{ulimit \-m}}, on any UNIX-like OS. I'm not sure that I would recommend this configuration, since the behavior for when the ulimit is exceeded is that the process is terminated. I think that it would be better in most cases to have the management software running on the cluster examine heap sizes, and warn when memory is getting low. bq. Can I take this to mean that there will be no new native code written as part of this project? Of course, we can always do a native code implementation later if use of a private Sun API becomes problematic, but I wanted to understand the code footprint for the current proposal. Avoiding native code entirely would be nice, because it reduces the scope of testing efforts across multiple platforms. Yeah, we are hoping to avoid writing any JNI code here. So far, it looks good on that front. bq. Are you proposing that off-heaping is an opt-in feature that must be explicitly enabled in configuration, or are you proposing that off-heaping will be the new default behavior? Arguably, jumping to off-heaping as the default could be seen as a backwards-incompatibility, because it might be unsafe to deploy the feature without simultaneous down-tuning the NameNode max heap size. Some might see that as backwards-incompatible with existing configurations. I think off-heaping should be on by default. HDFS gets enough bad press from having short-circuit and other optimizations turned off by default... we should be a little nicer this time :) With regard to compatibility... even if the NN max heap size is unchanged, the NN will not actually consume that memory until it needs it, right? So existing configuration should work just fine. I guess the one exception might be if you set a super-huge *minimum* (not maximum) JVM heap size. But even in that case, I would expect things to work, since virtual memory would pick up the slack. Unless you have turned off swapping, but that's not recommended. As a practical matter, we have found that everyone who has a big NN heap has a big NN machine, which has much more memory than we can even use currently. So I would not expect this to be a problem in practice. bq. Arpit asked: Do you have any estimates for startup time overhead due to GCs? There are a few big users who can shave several minutes off of their NN startup time by setting the NN minimum heap size to something large. This prevents successive rounds of stop the world GC where we copy everything to a new, 2x as large heap. We will get some before and after startup numbers later that should illustrate this even more clearly. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338865#comment-14338865 ] Chris Nauroth commented on HDFS-7836: - +1 for the proposal overall. Thanks for the great write-up, Charles and Colin. I have questions on a few details: bq. The Java heap size will be reduced, because the BlockManager data will be off-heap. Do you intend to enforce an upper limit on growth of the off-heap allocation? If so, do you see this as a new configuration property or as a function of an existing parameter (i.e. equal to max heap with the consideration that the block map takes ~50% of the heap now)? {{-Xmx}} alone will no longer be sufficient to define a ceiling for RAM utilization by the NameNode process. This can be important in deployments that choose to co-locate other Hadoop daemons on the same hosts as the NameNode. bq. malloc is also accessible without the use of JNI via the Unsafe package. Can I take this to mean that there will be no new native code written as part of this project? Of course, we can always do a native code implementation later if use of a private Sun API becomes problematic, but I wanted to understand the code footprint for the current proposal. Avoiding native code entirely would be nice, because it reduces the scope of testing efforts across multiple platforms. bq. The offheaping code should have the option to use on-heap memory. Are you proposing that off-heaping is an opt-in feature that must be explicitly enabled in configuration, or are you proposing that off-heaping will be the new default behavior? Arguably, jumping to off-heaping as the default could be seen as a backwards-incompatibility, because it might be unsafe to deploy the feature without simultaneous down-tuning the NameNode max heap size. Some might see that as backwards-incompatible with existing configurations. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339258#comment-14339258 ] Colin Patrick McCabe commented on HDFS-7836: Thanks, Chris. I agree that \-Xmx == \-Xms is often a useful configuration. It would be interesting to get more information about how much change in those settings will be necessary in general with off-heap. We'll have to think about that. I created a branch for this work, HDFS-7836. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339353#comment-14339353 ] Arpit Agarwal commented on HDFS-7836: - Thanks for the responses Charles and Colin. bq. We have seen block reports in the 100MB+ sizes so we suspect that an even small chunksize than a storage may yield benefits. IIRC the default 64MB protobuf message limit is hit at 9M blocks. Even with a hypothetical 10TB disk a low average block size of 10MB, we get 1M blocks/disk in the foreseeable future. With splitting that gets you to a reasonable ~7MB block report per disk. I am not saying no to chunking/compression but it would be useful to see some perf comparison before we add that complexity. In the past I used CreateEditsLog to generate files and a simple shell script to generate block files on DataNodes to simulate millions of blocks. Not as convenient as junit but I'll see if I can clean up and post what I used on HDFS-7847. bq. So this proposal is two part. First, use better locking semantics so that we don't have to take the FSN lock. bq. Even if we continue to process BRs under one big happy FSN lock, having multiple threads operate concurrently will yield benefits. These two sound contradictory. I assume the former is correct and we won't really take the FSN lock. Also I did not get how you will process one stripe at a time without repeatedly locking and unlocking, since DataNodes wouldn't know about the block to stripe mapping to order the reports. I guess I will wait to see the code. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339385#comment-14339385 ] Charles Lamb commented on HDFS-7836: bq. it would be useful to see some perf comparison before we add that complexity. We definitely plan on getting some baseline measurements and sharing them. We definitely want to know what the before and after effects are of any changes. As an aside, I worked on a case where we had to increase the RPC limit to 192MB in order to get block reports handled correctly so I know this type of deployments are out there. bq. I'll see if I can clean up and post what I used on HDFS-7847. That would be much appreciated. I'm starting to look at HDFS-7847 (subtask of this Jira) and maybe that could come into play somehow. bq. These two sound contradictory. Yes, they do, but aren't meant to be. The first level would be to do concurrent processing under the FSN lock. That would at least get some parallelism. The second step would be to make a more lockless blocksMap which wouldn't require the big FSN lock to be held. BTW, since you have edit privs, if you want to get rid of my reduntent reply to you above that would be great. My browser suckered me into hitting Add twice. Thanks. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339695#comment-14339695 ] Colin Patrick McCabe commented on HDFS-7836: bq. The hashing scheme should probably not be that simple \[as mod 5\]. Block IDs are sequentially allocated so it is not hard to think of pathological app behavior causing skewed block distribution across stripes over time. Hmm. Our sequential block allocations should guarantee that mod N produces an approximately equal number of blocks in each stripe. It is only with randomly allocated block IDs that we could even theoretically get an imbalance (although the probability is vanishingly small even there if the randomness is uniform.). With sequentially allocated block IDs the stripes will always be of equal size. I guess deletions of blocks could change that, but I see no reason why any group of blocks mod N should be more deleted than another group. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339693#comment-14339693 ] Colin Patrick McCabe commented on HDFS-7836: bq. These two sound contradictory. I assume the former is correct and we won't really take the FSN lock. Also I did not get how you will process one stripe at a time without repeatedly locking and unlocking, since DataNodes wouldn't know about the block to stripe mapping to order the reports. I guess I will wait to see the code. Let me clarify a bit. It may be necessary to take the FSN lock very briefly once at the start or end of the block report, but in general, we do not want to do the block report under the FSN lock as it is done currently. The main processing of blocks should not happen under the FSN lock. The idea is to separate the BlocksMap into multiple stripes. Which blocks go into which stripes is determined by blockID. Something like blockID mod 5 would work for this. Clearly, the incoming block report will contain work for several stripes. As you mentioned, that necessitates locking and unlocking. But we can do multiple blocks each time we grab a stripe lock. We simply accumulate a bunch of blocks that we know are in each stripe in a per-stripe buffer (maybe we do 1000 blocks at a time in between lock release... maybe a little more). This is something that we can tune to make a tradeoff between latency and throughput. This also means that we will have to be able to handle operations on blocks, new blocks being created, etc. during a block report. bq. IIRC the default 64MB protobuf message limit is hit at 9M blocks. Even with a hypothetical 10TB disk and a low average block size of 10MB, we get 1M blocks/disk in the foreseeable future. With splitting that gets you to a reasonable ~7MB block report per disk. I am not saying no to chunking/compression but it would be useful to see some perf comparison before we add that complexity. 1 M blocks per disk, on 10 disks, and 24 bytes per block, is a 240 MB block report (did I do that math right?) That's definitely bigger than we'd like the full BR RPC to be, and compression can help here. Or possibly separating the block report into multiple RPCs. Perhaps one RPC per storage? Incidentally, there's nothing magic about the 64 MB RPC size. I originally added that number as the max RPC size when I did HADOOP-9676, just by choosing a large power of 2 that seemed way too big for a real, non-corrupt RPC message. :) But big RPCs are bad in general because of the way Hadoop IPC works. To improve our latency, we probably want a block report RPC size much smaller 64 MB. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337977#comment-14337977 ] Arpit Agarwal commented on HDFS-7836: - Hi Charles, thank you for posting the design doc. This is a very interesting proposal. A few comments: bq. Full block reports are problematic. The more blocks each DataNode has, the longer it takes to process a full block report from that DataNode. bq. Since the protobufs code encodes and decodes from Java arrays, this necessitates the allocation of a very large java array, which may cause garbage collection issues. By splitting block reports into multiple arrays, we can avoid these issues. bq. It will eliminate the long pauses that we have now as a consequence of full block reports coming in bq. We can compress block reports via lz4 or gzip. The DataNode can now split block reports per storage directory (post HDFS-2832), controlled by {{DFS_BLOCKREPORT_SPLIT_THRESHOLD_KEY}}. Did you get a chance to try it out and see if it helps? Splitting reports addresses all of the above. bq. We have found that large Java heap sizes lead to very long startup times for the NameNode. This is because multiple full GCs often occur during startup ... Reducing the size of the Java heap will improve the NameNode startup time. Do you have any estimates for startup time overhead due to GCs? bq. The Java heap size will be reduced, because the BlockManager data will be offheap. We have found that BlockManager data makes up about 50% of a typical NameNode heap. Thanks for mentioning this in the doc, it's a significant point. bq. We will shard the BlocksMap into several “lock stripes.” The number of lock stripes should be matched to the number of CPUs available and the degree of concurrency in the workload. How does this affect block report processing? We cannot assume DataNodes will sort blocks by target stripe. Will the NameNode sort received reports or will it acquire+release a lock per block? If the former, then there should probably be some randomization of order across threads to avoid unintended serialization e.g. lock convoys. bq. Block IDs would be sharded into stripes based on a simple algorithm possibly modulus. The scheme should probably not be that simple. Block IDs are sequentially allocated so it is not hard to think of pathological app behavior causing skewed block distribution across stripes over time. bq. We will use the “flyweight pattern” to permit access to the offheap memory by wrapping it with a shortlived Java object containing appropriate accessors and mutators. It will be interesting to measure the impact on GC and CPU, if any. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335747#comment-14335747 ] Charles Lamb commented on HDFS-7836: Problem Statement The number of blocks stored by the largest HDFS clusters continues to increase. This increase adds pressure to the BlockManager, that part of the NameNode which handles block data from across the cluster. Full block reports are problematic. The more blocks each DataNode has, the longer it takes to process a full block report from that DataNode. Storage densities have roughly doubled each year for the past few years. Meanwhile, increases in CPU power have come mostly in the form of additional cores rather than faster clock speeds. Currently, the NameNode cannot use these additional cores because full block reports are processed while holding the namesystem lock. The BlockManager stores all blocks in memory and this contributes to a large heap size. As the NameNode Java heap size has grown, full garbage collection events have started to take several minutes. Although it is often possible to avoid full GCs by re-using Java objects, they remain an operational concern for administrators. They also contribute to a long NameNode startup time, sometimes measured in tens of minutes for the biggest clusters. Goals We need to improve the BlockManager to handle the challenges of the next few years. Our specific goals for this project are to: * Reduce lock contention for the FSNamesystem lock * Enable concurrent processing of block reports * Reduce the Java heap size of the NameNode * Optimize the use of network resources [~cmccabe] and I will be working on this Jira. We propose doing this work on a separate branch. If there is interest in a community meeting to discuss these changes, then perhaps Tuesday 3/10/15 at Cloudera in Palo Alto, CA would work? I suggest that date because I will be in the bay area that day and would like to meet with other interested community members in person. I'll also be around 3/11 and 3/12 if we need an alternate date. BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)