JingsongLi commented on code in PR #8109:
URL: https://github.com/apache/paimon/pull/8109#discussion_r3349044401
##########
paimon-core/src/main/java/org/apache/paimon/globalindex/btree/BTreeGlobalIndexBuilder.java:
##########
@@ -143,7 +143,7 @@ public Optional<Pair<RowRangeIndex, List<DataSplit>>>
scan() {
if (snapshot == null) {
return Optional.empty();
}
- snapshotReader = snapshotReader.withSnapshot(snapshot);
+ snapshotReader = withReadType(snapshotReader.withSnapshot(snapshot));
Review Comment:
This pruning changes the semantics for data-evolution tables when the
indexed column was added after some data was already written.
`DataEvolutionFileStoreScan.withReadType` filters out manifest entries whose
physical file schema does not contain any requested non-system field. For a
newly added indexed column, old files do not contain that field, but they
should still be scanned and indexed with a `NULL` key (the reader can project
the missing column as null, and the BTree writer/reader already support null
keys). With this change those old files are dropped during `scan()`, so `IS
NULL` queries on the new column miss the old rows after the index is built. I
verified this with a small regression: write rows, add column `f3`, write one
new row with `f3`, build a BTree index on `f3`, then global-index scan `f3 IS
NULL`; expected the old rows, but the index returned 0. Could we avoid applying
this read-type pruning for normal data files that lack the indexed column,
while still excl
uding blob/vector side files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]