JingsongLi commented on code in PR #8109:
URL: https://github.com/apache/paimon/pull/8109#discussion_r3349044401


##########
paimon-core/src/main/java/org/apache/paimon/globalindex/btree/BTreeGlobalIndexBuilder.java:
##########
@@ -143,7 +143,7 @@ public Optional<Pair<RowRangeIndex, List<DataSplit>>> 
scan() {
         if (snapshot == null) {
             return Optional.empty();
         }
-        snapshotReader = snapshotReader.withSnapshot(snapshot);
+        snapshotReader = withReadType(snapshotReader.withSnapshot(snapshot));

Review Comment:
   This pruning changes the semantics for data-evolution tables when the 
indexed column was added after some data was already written. 
`DataEvolutionFileStoreScan.withReadType` filters out manifest entries whose 
physical file schema does not contain any requested non-system field. For a 
newly added indexed column, old files do not contain that field, but they 
should still be scanned and indexed with a `NULL` key (the reader can project 
the missing column as null, and the BTree writer/reader already support null 
keys). With this change those old files are dropped during `scan()`, so `IS 
NULL` queries on the new column miss the old rows after the index is built. I 
verified this with a small regression: write rows, add column `f3`, write one 
new row with `f3`, build a BTree index on `f3`, then global-index scan `f3 IS 
NULL`; expected the old rows, but the index returned 0. Could we avoid applying 
this read-type pruning for normal data files that lack the indexed column, 
while still excl
 uding blob/vector side files?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to