deeppatel710 commented on code in PR #18144:
URL: https://github.com/apache/pinot/pull/18144#discussion_r3069168634
##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountHLLAggregationFunction.java:
##########
@@ -113,11 +136,21 @@ public void aggregate(int length, AggregationResultHolder
aggregationResultHolde
protected void aggregateSV(int length, AggregationResultHolder
aggregationResultHolder, BlockValSet blockValSet,
DataType storedType) {
- // For dictionary-encoded expression, store dictionary ids into the bitmap
+ // For dictionary-encoded expression, use adaptive strategy based on
dictionary size
Dictionary dictionary = blockValSet.getDictionary();
if (dictionary != null) {
int[] dictIds = blockValSet.getDictionaryIdsSV();
- getDictIdBitmap(aggregationResultHolder, dictionary).addN(dictIds, 0,
length);
+ if (dictionary.length() > _dictSizeThreshold) {
Review Comment:
Thanks for the review, @xiangfu0! Great catch on the type-safety concern.
You're right that mixing the bitmap and HLL paths in the same result holder
is unsafe. To be precise about when this can bite: within a single consuming
(realtime) segment, the dictionary grows as new data is ingested. If one
doc-batch sees a dictionary below the threshold (stores DictIdsWrapper) and a
subsequent batch on the same scan sees the grown dictionary above the
threshold, getHyperLogLog() would cast the existing DictIdsWrapper to
HyperLogLog and throw a ClassCastException.
The fix centralizes the
defense in both getHyperLogLog() helpers — they now check instanceof
DictIdsWrapper and convert to HyperLogLog before returning. This covers all 6
aggregation paths since they all funnel through these two methods. Added a
regression test that simulates the dictionary-growth scenario directly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]