Copilot commented on code in PR #17718:
URL: https://github.com/apache/pinot/pull/17718#discussion_r2819974383
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java:
##########
Review Comment:
The _firstRawValuesByKey map is not cleared when seal() is called, which
retains references to potentially large objects even after they're no longer
needed. Following the pattern used in StringColumnPreIndexStatsCollector (line
145 sets _values = null after sealing), this map should be nulled out in the
seal() method to allow garbage collection.
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java:
##########
@@ -279,9 +315,31 @@ private AbstractColumnStatisticsCollector
createKeyStatsCollector(String key, Ob
case OBJECT:
return new StringColumnPreIndexStatsCollector(key, config);
default:
- LOGGER.warn("Unknown data type {} for key {} and value type {}", type,
key, value.getClass().getName());
+ LOGGER.warn("Unknown data type {} for key {}", dataType, key);
return new StringColumnPreIndexStatsCollector(key, config);
+ }
+ }
+
+ private AbstractColumnStatisticsCollector
promoteNumericKeyStatsToStringCollector(String key, Object value) {
+ Object firstRawValue = _firstRawValuesByKey.get(key);
+ _firstRawValuesByKey.putIfAbsent(key, value);
+
+ AbstractColumnStatisticsCollector stringKeyStats =
createStringKeyStatsCollector(key);
+ _keyStats.put(key, stringKeyStats);
+ // Best-effort recovery: replay the first observed raw value as string to
keep a partial history.
+ if (firstRawValue != null) {
+ stringKeyStats.collect(String.valueOf(firstRawValue));
}
Review Comment:
This promotion logic only replays the first value and the current value,
losing all intermediate values collected by the numeric stats collector. For
example, if a key has values [123, 456, 789, "abc"], only 123 and "abc" will be
collected in the string stats, losing 456 and 789. This results in incorrect
statistics (wrong cardinality, min/max values).
To fix this, the promotion should replay all values previously collected by
the numeric stats collector, not just the first one. Consider extracting all
collected values from the old collector before replacing it.
```suggestion
AbstractColumnStatisticsCollector oldKeyStats = _keyStats.get(key);
AbstractColumnStatisticsCollector stringKeyStats =
createStringKeyStatsCollector(key);
_keyStats.put(key, stringKeyStats);
// Replay all previously collected values from the old numeric stats
collector, if available.
if (oldKeyStats != null) {
Object[] previousValues = oldKeyStats.getSortedValues();
if (previousValues != null) {
for (Object previousValue : previousValues) {
if (previousValue != null) {
stringKeyStats.collect(String.valueOf(previousValue));
}
}
}
} else if (firstRawValue != null) {
// Fallback: if there is no old collector, keep the previous
best-effort behavior and replay the first value.
stringKeyStats.collect(String.valueOf(firstRawValue));
}
```
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java:
##########
Review Comment:
The class-level documentation states that "Each key has a single type for
the value's associated with it across all documents" and that "heterogeneous
value types for a key...can be raised as a fault." However, this PR introduces
support for mixed types by promoting numeric collectors to string collectors.
The documentation should be updated to reflect that heterogeneous types are now
handled via type promotion to STRING rather than being treated as a fault.
##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollectorTest.java:
##########
@@ -351,6 +351,29 @@ public void testTypeConversionAndJsonSerialization() {
assertEquals(sObj.getMaxValue(), "{\"k1\":\"v1\",\"k2\":\"v2\"}");
}
+ @Test
+ public void testNumericKeyTypePromotedToStringForMixedValues() {
+ Map<String, Object> r1 = new HashMap<>();
+ r1.put("traceId", 9876543210L);
+
+ Map<String, Object> r2 = new HashMap<>();
+ r2.put("traceId", "c69b6613-e174-49f1-ac47-4e9ab98e513f");
+
+ StatsCollectorConfig cfg = newConfig(false);
+ MapColumnPreIndexStatsCollector col = new
MapColumnPreIndexStatsCollector("col", cfg);
+ col.collect(r1);
+ col.collect(r2);
+ col.seal();
+
+ AbstractColumnStatisticsCollector keyStats =
col.getKeyStatistics("traceId");
+ assertNotNull(keyStats);
+ assertTrue(keyStats instanceof StringColumnPreIndexStatsCollector);
+ assertEquals(keyStats.getCardinality(), 2);
+ assertEquals(keyStats.getMinValue(), "9876543210");
+ assertEquals(keyStats.getMaxValue(),
"c69b6613-e174-49f1-ac47-4e9ab98e513f");
+ assertEquals(keyStats.getTotalNumberOfEntries(), 2);
+ }
+
Review Comment:
This test only covers the case where the first value is numeric and the
second value is a string. It doesn't test the scenario where there are multiple
numeric values before a string value is encountered, which would expose a
critical bug in the promotion logic (intermediate values are lost). Consider
adding a test case like: [123L, 456L, 789L, "abc"] to verify all numeric values
are preserved after promotion.
```suggestion
@Test
public void testNumericKeyTypePromotionWithMultipleNumericValues() {
Map<String, Object> r1 = new HashMap<>();
r1.put("traceId", 123L);
Map<String, Object> r2 = new HashMap<>();
r2.put("traceId", 456L);
Map<String, Object> r3 = new HashMap<>();
r3.put("traceId", 789L);
Map<String, Object> r4 = new HashMap<>();
r4.put("traceId", "abc");
StatsCollectorConfig cfg = newConfig(false);
MapColumnPreIndexStatsCollector col = new
MapColumnPreIndexStatsCollector("col", cfg);
col.collect(r1);
col.collect(r2);
col.collect(r3);
col.collect(r4);
col.seal();
AbstractColumnStatisticsCollector keyStats =
col.getKeyStatistics("traceId");
assertNotNull(keyStats);
assertTrue(keyStats instanceof StringColumnPreIndexStatsCollector);
assertEquals(keyStats.getCardinality(), 4);
assertEquals(keyStats.getMinValue(), "123");
assertEquals(keyStats.getMaxValue(), "abc");
assertEquals(keyStats.getTotalNumberOfEntries(), 4);
}
```
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java:
##########
@@ -279,9 +315,31 @@ private AbstractColumnStatisticsCollector
createKeyStatsCollector(String key, Ob
case OBJECT:
return new StringColumnPreIndexStatsCollector(key, config);
default:
- LOGGER.warn("Unknown data type {} for key {} and value type {}", type,
key, value.getClass().getName());
+ LOGGER.warn("Unknown data type {} for key {}", dataType, key);
return new StringColumnPreIndexStatsCollector(key, config);
+ }
+ }
+
+ private AbstractColumnStatisticsCollector
promoteNumericKeyStatsToStringCollector(String key, Object value) {
+ Object firstRawValue = _firstRawValuesByKey.get(key);
+ _firstRawValuesByKey.putIfAbsent(key, value);
Review Comment:
This line is redundant because the key was already stored in
_firstRawValuesByKey when the key stats collector was first created (line 112).
Since putIfAbsent only puts the value if the key is absent, and the key is
always present when this method is called (because it was set on line 112),
this line has no effect and can be removed.
```suggestion
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]