[I] Optimize DistinctCountHLL aggregation for high-cardinality dictionary-encoded columns [pinot]

via GitHub Mon, 08 Dec 2025 21:57:34 -0800


praveenc7 opened a new issue, #17336:
URL: https://github.com/apache/pinot/issues/17336


   ## Problem
   The DISTINCTCOUNTHLL aggregation function suffers from severe [performance 
degradation 
](https://github.com/apache/pinot/blob/f46f631ce179c9cb152b9846f580f01f4ffa33ae/pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountHLLAggregationFunction.java#L107)when
 processing high-cardinality dictionary-encoded columns (14 Million). Profiling 
shows that 50% of CPU time is spent in RoaringBitmap operations:
   
   <img width="1584" height="684" alt="Image" 
src="https://github.com/user-attachments/assets/50cdb0ba-8cd5-412c-ae98-038fd5497ee9";
 />
   
   For dictionary-encoded columns, the current implementation uses 
RoaringBitmap to track dictionary IDs during aggregation. While 
memory-efficient for low cardinality, this approach has O(n log n) insertion 
complexity that becomes prohibitively expensive for high-cardinality columns 
(>100K distinct values).
   
   Queries on high-cardinality columns (1M - 15M) (e.g., user IDs, member) 
takes about 6 - 10sec. RoaringBitmap operations dominate query execution time. 
No performance benefit from using HLL over distinct count
   
   
   ## Proposed Solution
   Implement adaptive cardinality handling that dynamically switches from 
RoaringBitmap to HyperLogLog:
   1. Low cardinality : Use RoaringBitmap (memory efficient, exact counts)
   2. High cardinality : Convert to HyperLogLog (O(1) insertions)
   
   Tested with a POC code where we choose HyperLogLog for High- cardinality 
column and observed improvements from 8sec -> 700ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Optimize DistinctCountHLL aggregation for high-cardinality dictionary-encoded columns [pinot]

Reply via email to