praveenc7 commented on issue #17336: URL: https://github.com/apache/pinot/issues/17336#issuecomment-3649950048
Thanks @Jackie-Jiang , Let me consider that when testing this. > For high cardinality, when the same dictionary id repeats a lot, directly inserting into HLL might produce worse performance Agree In the segment I tested (~25M rows, ~14M cardinality), HLL outperformed for the query I tested which produced 11 million distinct That said, I agree the picture flips in the low-cardinality case. If we had something like 25M rows and ~10K distinct, then: - BitSet should likely win on throughput thanks to O(1) set operations and predictable access patterns, while HLL may end up paying similar per-row cost without gaining much. What I’m planning next is to introduce a switch threshold (similar in spirit to “smartHLL,” but adapted to our use case): start with an exact structure (e.g., BitSet/Roaring), and only promote to HLL once the observed distinct count / density indicates it’s worth it. I’ll test a range of cardinalities and repetition patterns to see if there’s a “sweet spot” for that cutoff. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
