praveenc7 commented on PR #17411: URL: https://github.com/apache/pinot/pull/17411#issuecomment-3717501228
> overall lgtm! > > One question, how was the default of 100_000 chosen as the threshold? Is that a desirable default for all queries going forward? Are there any scenarios where using the Integer.MAX_VALUE may be preferred? It'll be good to call this out in documentation as a guide for users > > Also, can you take care of updating OSS documentation to reflect these changes? Thanks! Thanks for the review @somandal On the 100,000 threshold? I chose 100K based on a combination of (a) micro-benchmarking across multiple recordCount sizes and cardinality ratios, and (b) validating behavior on the perf cluster with representative query patterns. The crossover point varies a bit by record count/cardinality ratio, but ~100K is a consistent “safe default” where the direct HLL path stops regressing and starts outperforming the RoaringBitmap-heavy path for workloads. (I included representative benchmark tables in the Testing section) When would Integer.MAX_VALUE be preferred? With the adaptive threshold in place, I don’t see a strong need to use Integer.MAX_VALUE in normal scenarios. The intent of the change is precisely to handle both low- and high-cardinality cases automatically. Added it as a guardrail if something odd shows up in production -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
