Copilot commented on code in PR #17718:
URL: https://github.com/apache/pinot/pull/17718#discussion_r2819974383


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java:
##########


Review Comment:
   The _firstRawValuesByKey map is not cleared when seal() is called, which 
retains references to potentially large objects even after they're no longer 
needed. Following the pattern used in StringColumnPreIndexStatsCollector (line 
145 sets _values = null after sealing), this map should be nulled out in the 
seal() method to allow garbage collection.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java:
##########
@@ -279,9 +315,31 @@ private AbstractColumnStatisticsCollector 
createKeyStatsCollector(String key, Ob
       case OBJECT:
         return new StringColumnPreIndexStatsCollector(key, config);
       default:
-        LOGGER.warn("Unknown data type {} for key {} and value type {}", type, 
key, value.getClass().getName());
+        LOGGER.warn("Unknown data type {} for key {}", dataType, key);
         return new StringColumnPreIndexStatsCollector(key, config);
+      }
+  }
+
+  private AbstractColumnStatisticsCollector 
promoteNumericKeyStatsToStringCollector(String key, Object value) {
+    Object firstRawValue = _firstRawValuesByKey.get(key);
+    _firstRawValuesByKey.putIfAbsent(key, value);
+
+    AbstractColumnStatisticsCollector stringKeyStats = 
createStringKeyStatsCollector(key);
+    _keyStats.put(key, stringKeyStats);
+    // Best-effort recovery: replay the first observed raw value as string to 
keep a partial history.
+    if (firstRawValue != null) {
+      stringKeyStats.collect(String.valueOf(firstRawValue));
     }

Review Comment:
   This promotion logic only replays the first value and the current value, 
losing all intermediate values collected by the numeric stats collector. For 
example, if a key has values [123, 456, 789, "abc"], only 123 and "abc" will be 
collected in the string stats, losing 456 and 789. This results in incorrect 
statistics (wrong cardinality, min/max values).
   
   To fix this, the promotion should replay all values previously collected by 
the numeric stats collector, not just the first one. Consider extracting all 
collected values from the old collector before replacing it.
   ```suggestion
       AbstractColumnStatisticsCollector oldKeyStats = _keyStats.get(key);
   
       AbstractColumnStatisticsCollector stringKeyStats = 
createStringKeyStatsCollector(key);
       _keyStats.put(key, stringKeyStats);
   
       // Replay all previously collected values from the old numeric stats 
collector, if available.
       if (oldKeyStats != null) {
         Object[] previousValues = oldKeyStats.getSortedValues();
         if (previousValues != null) {
           for (Object previousValue : previousValues) {
             if (previousValue != null) {
               stringKeyStats.collect(String.valueOf(previousValue));
             }
           }
         }
       } else if (firstRawValue != null) {
         // Fallback: if there is no old collector, keep the previous 
best-effort behavior and replay the first value.
         stringKeyStats.collect(String.valueOf(firstRawValue));
       }
   ```



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java:
##########


Review Comment:
   The class-level documentation states that "Each key has a single type for 
the value's associated with it across all documents" and that "heterogeneous 
value types for a key...can be raised as a fault." However, this PR introduces 
support for mixed types by promoting numeric collectors to string collectors. 
The documentation should be updated to reflect that heterogeneous types are now 
handled via type promotion to STRING rather than being treated as a fault.



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollectorTest.java:
##########
@@ -351,6 +351,29 @@ public void testTypeConversionAndJsonSerialization() {
     assertEquals(sObj.getMaxValue(), "{\"k1\":\"v1\",\"k2\":\"v2\"}");
   }
 
+  @Test
+  public void testNumericKeyTypePromotedToStringForMixedValues() {
+    Map<String, Object> r1 = new HashMap<>();
+    r1.put("traceId", 9876543210L);
+
+    Map<String, Object> r2 = new HashMap<>();
+    r2.put("traceId", "c69b6613-e174-49f1-ac47-4e9ab98e513f");
+
+    StatsCollectorConfig cfg = newConfig(false);
+    MapColumnPreIndexStatsCollector col = new 
MapColumnPreIndexStatsCollector("col", cfg);
+    col.collect(r1);
+    col.collect(r2);
+    col.seal();
+
+    AbstractColumnStatisticsCollector keyStats = 
col.getKeyStatistics("traceId");
+    assertNotNull(keyStats);
+    assertTrue(keyStats instanceof StringColumnPreIndexStatsCollector);
+    assertEquals(keyStats.getCardinality(), 2);
+    assertEquals(keyStats.getMinValue(), "9876543210");
+    assertEquals(keyStats.getMaxValue(), 
"c69b6613-e174-49f1-ac47-4e9ab98e513f");
+    assertEquals(keyStats.getTotalNumberOfEntries(), 2);
+  }
+

Review Comment:
   This test only covers the case where the first value is numeric and the 
second value is a string. It doesn't test the scenario where there are multiple 
numeric values before a string value is encountered, which would expose a 
critical bug in the promotion logic (intermediate values are lost). Consider 
adding a test case like: [123L, 456L, 789L, "abc"] to verify all numeric values 
are preserved after promotion.
   ```suggestion
   
     @Test
     public void testNumericKeyTypePromotionWithMultipleNumericValues() {
       Map<String, Object> r1 = new HashMap<>();
       r1.put("traceId", 123L);
   
       Map<String, Object> r2 = new HashMap<>();
       r2.put("traceId", 456L);
   
       Map<String, Object> r3 = new HashMap<>();
       r3.put("traceId", 789L);
   
       Map<String, Object> r4 = new HashMap<>();
       r4.put("traceId", "abc");
   
       StatsCollectorConfig cfg = newConfig(false);
       MapColumnPreIndexStatsCollector col = new 
MapColumnPreIndexStatsCollector("col", cfg);
       col.collect(r1);
       col.collect(r2);
       col.collect(r3);
       col.collect(r4);
       col.seal();
   
       AbstractColumnStatisticsCollector keyStats = 
col.getKeyStatistics("traceId");
       assertNotNull(keyStats);
       assertTrue(keyStats instanceof StringColumnPreIndexStatsCollector);
       assertEquals(keyStats.getCardinality(), 4);
       assertEquals(keyStats.getMinValue(), "123");
       assertEquals(keyStats.getMaxValue(), "abc");
       assertEquals(keyStats.getTotalNumberOfEntries(), 4);
     }
   ```



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java:
##########
@@ -279,9 +315,31 @@ private AbstractColumnStatisticsCollector 
createKeyStatsCollector(String key, Ob
       case OBJECT:
         return new StringColumnPreIndexStatsCollector(key, config);
       default:
-        LOGGER.warn("Unknown data type {} for key {} and value type {}", type, 
key, value.getClass().getName());
+        LOGGER.warn("Unknown data type {} for key {}", dataType, key);
         return new StringColumnPreIndexStatsCollector(key, config);
+      }
+  }
+
+  private AbstractColumnStatisticsCollector 
promoteNumericKeyStatsToStringCollector(String key, Object value) {
+    Object firstRawValue = _firstRawValuesByKey.get(key);
+    _firstRawValuesByKey.putIfAbsent(key, value);

Review Comment:
   This line is redundant because the key was already stored in 
_firstRawValuesByKey when the key stats collector was first created (line 112). 
Since putIfAbsent only puts the value if the key is absent, and the key is 
always present when this method is called (because it was set on line 112), 
this line has no effect and can be removed.
   ```suggestion
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to