Re: [PR] API, Core: Add FileIO to Scan API [iceberg]

via GitHub Thu, 12 Mar 2026 14:39:57 -0700


rdblue commented on code in PR #15561:
URL: https://github.com/apache/iceberg/pull/15561#discussion_r2927548872



##########
core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java:
##########
@@ -170,7 +166,7 @@ public CloseableIterable<FileScanTask> planFiles() {
           .withEndSnapshotId(endSnapshotId)
           .withUseSnapshotSchema(true);
     } else if (snapshotId != null) {
-      boolean useSnapShotSchema = snapshotId != 
table.currentSnapshot().snapshotId();
+      boolean useSnapShotSchema = snapshotId != 
table().currentSnapshot().snapshotId();

Review Comment:
   I don't think this is a blocker, but it doesn't look correct. Additional 
nit: "snapshot" is one word and should be capitalized "Snapshot".
   
   @singhpk234, The snapshot schema should be used when time traveling to a tag 
or a specific snapshot ID, but not when reading from a branch. That context 
comes from how the ref or snapshot was configured. Choosing a specific snapshot 
should generally send useSnapshotSchema=true, but just reading from a branch 
should not. Here is the description from the REST spec:
   
   > Whether to use the schema at the time the snapshot was written. When time 
travelling, the snapshot schema should be used (true). When scanning a branch, 
the table schema should be used (false).
   
   This comparison isn't going to be sufficient because the distinction is 
whether the snapshot was selected via branch name vs directly by ID or by tag 
name. This needs to know whether `useRef` was called and whether the ref was a 
branch.
   
   We also can't rely on comparing the result schema because `DataTableScan` 
(and its children) always override the schema to the branch/tag/snapshot schema 
and  when `useSnapshot` or `useRef` are called. That's because the fields 
passed to `select` are applied lazily in `planFiles`. The snapshot/ref 
selection changes the base schema and columns are selected by name, or a 
specific projection is used directly.
   
   To fix this, I think you need to override some of the API methods to detect 
how the scan is configured.
   
   And while we're looking at schema projection, I think the projection code is 
also incorrect:
   
   ```java
       List<String> selectedColumns =
           
schema().columns().stream().map(Types.NestedField::name).collect(Collectors.toList());
   ```
   
   The problem here is that it will select top-level fields only because 
`NestedField#name` returns the local field's name and not any children. To get 
the full field names, you'd need to call `TypeUtil.getProjectedIds` and then 
pass each ID through `schema().findColumnName(id)` like you do for stats 
fields. This is what `SnapshotScan` does:
   
   ```java
       List<Integer> projectedFieldIds = 
Lists.newArrayList(TypeUtil.getProjectedIds(schema()));
       List<String> projectedFieldNames =
           
projectedFieldIds.stream().map(schema()::findColumnName).collect(Collectors.toList());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] API, Core: Add FileIO to Scan API [iceberg]

Reply via email to