deeppatel710 commented on code in PR #18144:
URL: https://github.com/apache/pinot/pull/18144#discussion_r3069168634


##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctCountHLLAggregationFunction.java:
##########
@@ -113,11 +136,21 @@ public void aggregate(int length, AggregationResultHolder 
aggregationResultHolde
 
   protected void aggregateSV(int length, AggregationResultHolder 
aggregationResultHolder, BlockValSet blockValSet,
       DataType storedType) {
-    // For dictionary-encoded expression, store dictionary ids into the bitmap
+    // For dictionary-encoded expression, use adaptive strategy based on 
dictionary size
     Dictionary dictionary = blockValSet.getDictionary();
     if (dictionary != null) {
       int[] dictIds = blockValSet.getDictionaryIdsSV();
-      getDictIdBitmap(aggregationResultHolder, dictionary).addN(dictIds, 0, 
length);
+      if (dictionary.length() > _dictSizeThreshold) {

Review Comment:
   Thanks for the review, @xiangfu0!  Great catch on the type-safety concern.   
                                                                                
                                                   
                                                                                
                                                                                
                                                      
   You're right that mixing the bitmap and HLL paths in the same result holder 
is unsafe. To be precise about when this can bite: within a single consuming 
(realtime) segment, the dictionary grows as new data is ingested. If one 
doc-batch sees a dictionary below the threshold (stores DictIdsWrapper) and a 
subsequent batch on the same scan sees the grown dictionary above the 
threshold, getHyperLogLog() would cast the existing DictIdsWrapper to 
HyperLogLog and throw a ClassCastException.                                     
                                                                                
                  
                                                                                
                                                                                
                                                      The fix centralizes the 
defense in both getHyperLogLog() helpers — they now check instanceof 
DictIdsWrapper and convert to HyperLogLog before returning. This covers all 6 
aggregation paths since they all funnel through these two methods. Added a 
regression test that simulates the dictionary-growth scenario directly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to