liran-funaro opened a new issue #9967: URL: https://github.com/apache/druid/issues/9967
### Motivation The current incremental-index implementations (on-heap and off-heap) suffer from poor memory utilization and sub-optimal performance. In some ingestion scenarios, we observed 200% memory overhead and 70% runtime overhead that both are attributed to the GC mechanism. This is mainly due to the large number of metadata objects created by Java’s `ConcurrentSkipList` (CSL). ### Proposed changes We implemented an alternative incremental-index (`OakIncrementalIndex`) that has two main attributes that are different from the current implementations: 1. It stores both **keys** and **values** off-heap (as opposed to the off-heap implementation that stores only the **values** off-heap). 2. It is based on [`OakMap`](https://github.com/yahoo/Oak) [[1]](#paper1) instead of Java’s `ConcurrentSkipList` (CSL). These two changes significantly reduce the number of heap-objects and thus decrease dramatically the GC’s memory and performance overhead. This implementation was proposed before ([#5698](https://github.com/apache/druid/issues/5698) and [#7676](https://github.com/apache/druid/pull/7676)). This issue expands on these with system-level experiments results, as well as more comprehensive component-level benchmarks results (as requested by the community). In addition to improved performance compared to older versions. > <b id="paper1">[1]</b> Oak: a Scalable Off-Heap Allocated Key-Value Map. _ACM Conference on Principles and Practices of Parallel Programming (PPoPP) ‘2020_. ### Rationale Our implementation (`OakIncrementalIndex`) instantiates a sub-linear number of objects with respect to the number of rows in the incremental-index, as opposed to a linear number of metadata objects that are instantiated by CSL. For typical Incremental-Index sizes (e.g., the current flush threshold is 1M rows), this overhead is millions of Java metadata objects just for internal CSL use. In addition, an on-heap multi-dimensional key might include many small objects that increase the memory overhead even further, as opposed to `OakIncrementalIndex` that needs only one buffer object for many multi-dimensional keys. Our experiments show that when using `OnHeapIncrementalIndex` and `OffHeapIncrementalIndex`, Java GC requires roughly 200% memory compared to the raw data size to achieve reasonable ingestion speed. Furthermore, this large number of objects also incur longer GC pauses (about 40% of the runtime in our experiments) as there are many long-living objects to traverse. `OakIncrementalIndex` has only 2% memory overhead and negligible GC runtime overhead, yielding almost 33% of the memory usage and 60% of the runtime (1.7x ingestion throughput) compared to the on-heap and off-heap implementations. We evaluated `OakIncrementalIndex` with comparison to `OnHeapIncrementalIndex` and `OffHeapIncrementalIndex` via system-level experiments and component-level benchmarks. The experimental setup and the results are depicted [here](https://github.com/liran-funaro/druid/wiki/Evaluation). ### Test plan We modified all the unit-test and benchmarks to test all the available incremental-index implementations (on-heap, off-heap, and Oak). All the unit tests passed successfully. ### Operational impact This change will not affect any existing clusters. It will work seamlessly and interchangeably with existing incremental index implementations. See our wiki’s [usage](https://github.com/liran-funaro/druid/wiki/Usage) section for more details. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
