praveenc7 opened a new issue, #17336: URL: https://github.com/apache/pinot/issues/17336
## Problem The DISTINCTCOUNTHLL aggregation function suffers from severe [performance degradation ](https://github.com/apache/pinot/blob/f46f631ce179c9cb152b9846f580f01f4ffa33ae/pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountHLLAggregationFunction.java#L107)when processing high-cardinality dictionary-encoded columns (14 Million). Profiling shows that 50% of CPU time is spent in RoaringBitmap operations: <img width="1584" height="684" alt="Image" src="https://github.com/user-attachments/assets/50cdb0ba-8cd5-412c-ae98-038fd5497ee9" /> For dictionary-encoded columns, the current implementation uses RoaringBitmap to track dictionary IDs during aggregation. While memory-efficient for low cardinality, this approach has O(n log n) insertion complexity that becomes prohibitively expensive for high-cardinality columns (>100K distinct values). Queries on high-cardinality columns (1M - 15M) (e.g., user IDs, member) takes about 6 - 10sec. RoaringBitmap operations dominate query execution time. No performance benefit from using HLL over distinct count ## Proposed Solution Implement adaptive cardinality handling that dynamically switches from RoaringBitmap to HyperLogLog: 1. Low cardinality : Use RoaringBitmap (memory efficient, exact counts) 2. High cardinality : Convert to HyperLogLog (O(1) insertions) Tested with a POC code where we choose HyperLogLog for High- cardinality column and observed improvements from 8sec -> 700ms -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
