rdblue commented on code in PR #15561:
URL: https://github.com/apache/iceberg/pull/15561#discussion_r2927548872
##########
core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java:
##########
@@ -170,7 +166,7 @@ public CloseableIterable<FileScanTask> planFiles() {
.withEndSnapshotId(endSnapshotId)
.withUseSnapshotSchema(true);
} else if (snapshotId != null) {
- boolean useSnapShotSchema = snapshotId !=
table.currentSnapshot().snapshotId();
+ boolean useSnapShotSchema = snapshotId !=
table().currentSnapshot().snapshotId();
Review Comment:
I don't think this is a blocker, but it doesn't look correct. Additional
nit: "snapshot" is one word and should be capitalized "Snapshot".
@singhpk234, The snapshot schema should be used when time traveling to a tag
or a specific snapshot ID, but not when reading from a branch. That context
comes from how the ref or snapshot was configured. Choosing a specific snapshot
should generally send useSnapshotSchema=true, but just reading from a branch
should not. Here is the description from the REST spec:
> Whether to use the schema at the time the snapshot was written. When time
travelling, the snapshot schema should be used (true). When scanning a branch,
the table schema should be used (false).
This comparison isn't going to be sufficient because the distinction is
whether the snapshot was selected via branch name vs directly by ID or by tag
name. This needs to know whether `useRef` was called and whether the ref was a
branch.
We also can't rely on comparing the result schema because `DataTableScan`
(and its children) always override the schema to the branch/tag/snapshot schema
and when `useSnapshot` or `useRef` are called. That's because the fields
passed to `select` are applied lazily in `planFiles`. The snapshot/ref
selection changes the base schema and columns are selected by name, or a
specific projection is used directly.
To fix this, I think you need to override some of the API methods to detect
how the scan is configured.
And while we're looking at schema projection, I think the projection code is
also incorrect:
```java
List<String> selectedColumns =
schema().columns().stream().map(Types.NestedField::name).collect(Collectors.toList());
```
The problem here is that it will select top-level fields only because
`NestedField#name` returns the local field's name and not any children. To get
the full field names, you'd need to call `TypeUtil.getProjectedIds` and then
pass each ID through `schema().findColumnName(id)` like you do for stats
fields. This is what `SnapshotScan` does:
```java
List<Integer> projectedFieldIds =
Lists.newArrayList(TypeUtil.getProjectedIds(schema()));
List<String> projectedFieldNames =
projectedFieldIds.stream().map(schema()::findColumnName).collect(Collectors.toList());
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]