liran-funaro opened a new issue #9967:
URL: https://github.com/apache/druid/issues/9967


   ### Motivation
   
   The current incremental-index implementations (on-heap and off-heap) suffer 
from poor memory utilization and sub-optimal performance.
   In some ingestion scenarios, we observed 200% memory overhead and 70% 
runtime overhead that both are attributed to the GC mechanism.
   This is mainly due to the large number of metadata objects created by Java’s 
`ConcurrentSkipList` (CSL).
   
   ### Proposed changes
   
   We implemented an alternative incremental-index (`OakIncrementalIndex`) that 
has two main attributes that are different from the current implementations:
   
   1. It stores both **keys** and **values** off-heap (as opposed to the 
off-heap implementation that stores only the **values** off-heap).
   2. It is based on [`OakMap`](https://github.com/yahoo/Oak) [[1]](#paper1) 
instead of Java’s `ConcurrentSkipList` (CSL).
   
   These two changes significantly reduce the number of heap-objects and thus 
decrease dramatically the GC’s memory and performance overhead.
   
   This implementation was proposed before 
([#5698](https://github.com/apache/druid/issues/5698) and 
[#7676](https://github.com/apache/druid/pull/7676)).
   This issue expands on these with system-level experiments results, as well 
as more comprehensive component-level benchmarks results (as requested by the 
community).
   In addition to improved performance compared to older versions.
   
   > <b id="paper1">[1]</b>  Oak: a Scalable Off-Heap Allocated Key-Value Map. 
_ACM Conference on Principles and Practices of Parallel Programming (PPoPP) 
‘2020_.
   
   
   ### Rationale
   
   Our implementation (`OakIncrementalIndex`) instantiates a sub-linear number 
of objects with respect to the number of rows in the incremental-index, as 
opposed to a linear number of metadata objects that are instantiated by CSL.
   For typical Incremental-Index sizes (e.g., the current flush threshold is 1M 
rows), this overhead is millions of Java metadata objects just for internal CSL 
use.
   In addition, an on-heap multi-dimensional key might include many small 
objects that increase the memory overhead even further, as opposed to 
`OakIncrementalIndex` that needs only one buffer object for many 
multi-dimensional keys.
   Our experiments show that when using `OnHeapIncrementalIndex` and 
`OffHeapIncrementalIndex`, Java GC requires roughly 200% memory compared to the 
raw data size to achieve reasonable ingestion speed.
   Furthermore, this large number of objects also incur longer GC pauses (about 
40% of the runtime in our experiments) as there are many long-living objects to 
traverse.
   `OakIncrementalIndex` has only 2% memory overhead and negligible GC runtime 
overhead, yielding almost 33% of the memory usage and 60% of the runtime (1.7x 
ingestion throughput) compared to the on-heap and off-heap implementations.
   
   We evaluated `OakIncrementalIndex` with comparison to 
`OnHeapIncrementalIndex` and `OffHeapIncrementalIndex` via system-level 
experiments and component-level benchmarks. 
   The experimental setup and the results are depicted 
[here](https://github.com/liran-funaro/druid/wiki/Evaluation).
   
   ### Test plan
   
   We modified all the unit-test and benchmarks to test all the available 
incremental-index implementations (on-heap, off-heap, and Oak). All the unit 
tests passed successfully.
   
   ### Operational impact
   
   This change will not affect any existing clusters. It will work seamlessly 
and interchangeably with existing incremental index implementations.  See our 
wiki’s [usage](https://github.com/liran-funaro/druid/wiki/Usage) section for 
more details.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to