liran-funaro commented on issue #7900: URL: https://github.com/apache/druid/issues/7900#issuecomment-643750934
I realize that I'm a little late for the party since `CliIndexer` is already merged, but I just want to raise a possible issue with this design. Once many concurrent incremental-indexes will be processed on the same JVM heap, the number of the long-lived objects will be larger than any of the individual Peons. Unfuretntly, the JVM does not handle well workloads with a huge number of long-lived objects. This evidently causes long pause times for each GC cycle that can add up to up to 50% of the process runtime. However, the value of using the `CliIndexer`, IMO, is great. To solve this, I suggest storing all incremental index data (keys and values) off-heap, which will reduce the number of heap objects dramatically. Please, check out my issue (#9967) and PR (#10001) that solves exactly this problem. This solution improves the CPU and RAM utilization of the batch ingestion by over 50% in both serial and parallel ingestion modes, and might greatly improve the resource utilization and performance of the ingestion using the `CliIndexer`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
