praveenc7 commented on issue #17336:
URL: https://github.com/apache/pinot/issues/17336#issuecomment-3649950048

   Thanks @Jackie-Jiang , Let me consider that when testing this.
   
   > For high cardinality, when the same dictionary id repeats a lot, directly 
inserting into HLL might produce worse performance
   Agree
   
   In the segment I tested (~25M rows, ~14M cardinality), HLL outperformed for 
the query I tested which produced 11 million distinct That said, I agree the 
picture flips in the low-cardinality case. If we had something like 25M rows 
and ~10K distinct, then:
   - BitSet should likely win on throughput thanks to O(1) set operations and 
predictable access patterns, while HLL may end up paying similar per-row cost 
without gaining much.
   
   What I’m planning next is to introduce a switch threshold (similar in spirit 
to “smartHLL,” but adapted to our use case): start with an exact structure 
(e.g., BitSet/Roaring), and only promote to HLL once the observed distinct 
count / density indicates it’s worth it. I’ll test a range of cardinalities and 
repetition patterns to see if there’s a “sweet spot” for that cutoff.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to