sanastas commented on issue #5698: Oak: New Concurrent Key-Value Map URL: https://github.com/apache/incubator-druid/issues/5698#issuecomment-504917683 @jihoonson thanks for taking a look! First, indeed this proposal is about supporting OakIncrementalIndex. This is the idea we pursue for a while already. In general it is about building a bigger off-heap IncrementalIndexes and enjoying a good performance :) Would you like to work with us on promoting this possibility? Just as consultant, your insights are very valuable.... Second, Oak have couple of advantages when working with big data. As @ebortnik has mentioned, working with off-heap serialized data make it less affected to the JVM GC. In addition, Oak utilizes cache locality for searches. Lastly, Oak works good under multi-threading contention and scales well with multiple threads. However, in this specific experiment (single thread) the main problem should be caused by GC. Original, Druid's IncrementalIndex allocates the (to-be-added) rows on-heap prior to the benchmarks (taking 4GB out of given 12GB). Then StringIndexer takes more memory to save the String<->Integer translation, let's exaggerate and give it another 4GB. From here, for all other on-heap objects we remain with 4GB, which puts a lot of stress on GC. The ConcurrentSkipListMap used in Druid's IncrementalIndex is know to be less GC-friendly due to many small objects it allocates. I believe this is the reason for huge performance degradation we see. For example, here is a [reference](https://docs.oracle.com/cd/E19159-01/819-3681/6n5srlhqf/index.html) to Oracle themselves mentioning about Java Garbage Collector: > Garbage collection (GC) reclaims the heap space previously allocated to objects no longer needed. The process of locating and removing the dead objects can stall any application and consume as much as 25 percent throughput. And [here](http://gridgain.blogspot.com/2014/06/jdk-g1-garbage-collector-pauses-for.html) one can take a look on experimenting with GC and big heap sizes. The conclusion is: > From conducting numerous tests, we have concluded that unless you are utilizing some off-heap technology, no Garbage Collector provided with JDK will render any kind of stable GC performance with heap sizes larger that 16GB. For example, on 50GB heaps we can often encounter up to 5 minute GC pauses, with average pauses of 2 to 4 seconds. Would be really glad to hear your thoughts!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
