szehon-ho commented on code in PR #16523:
URL: https://github.com/apache/iceberg/pull/16523#discussion_r3352813348
##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##########
@@ -696,6 +696,50 @@ public void
testRequiredNestedFieldInOptionalStructFilter() {
sql("DROP TABLE IF EXISTS %s", nestedStructTable);
}
+ @TestTemplate
+ public void testTimeTravelFilterOnRenamedColumn() {
+ // create a separate table for this test
+ String ttTableName = tableName("tt_rename_table");
+ sql("DROP TABLE IF EXISTS %s", ttTableName);
+ sql("CREATE TABLE %s (id BIGINT, col DOUBLE) USING iceberg", ttTableName);
Review Comment:
This test doesn't actually exercise the changed code path.
`BaseDistributedDataScan` (via `SparkDistributedDataScan`) is only used when
distributed planning is enabled — see `SparkScanBuilder#newBatchScan`, which
otherwise falls back to `table.newBatchScan()` (the local `DataTableScan`
path). With the default `read.data-planning-mode=local`, this read never
reaches `BaseDistributedDataScan`, so the test passes with or without the fix.
To reproduce the original bug, please enable distributed planning, e.g.
create the table with `TBLPROPERTIES ('read.data-planning-mode'='distributed',
'read.delete-planning-mode'='distributed')`, and make sure
`spark.driver.maxResultSize >= 256MB` (required by
`SparkReadConf#distributedPlanningEnabled`). It'd be good to confirm the test
fails on `main` without the `specCache` change.
##########
core/src/main/java/org/apache/iceberg/BaseDistributedDataScan.java:
##########
@@ -70,6 +70,11 @@ protected BaseDistributedDataScan(Table table, Schema
schema, TableScanContext c
this.localPlanningSizeThreshold = localParallelism *
LOCAL_PLANNING_MAX_SLOT_SIZE;
}
+ @Override
+ protected boolean useSnapshotSchema() {
+ return true;
+ }
Review Comment:
This override is redundant and can be removed. `BaseDistributedDataScan
extends DataScan`, and `DataScan` already overrides `useSnapshotSchema()` to
return `true`, so this value is already inherited here. That means the
`specCache()` change below is the entire fix — this override is a no-op.
(Also, it isn't really "consistent with DataTableScan": `DataTableScan`
needs the override because it extends `BaseTableScan` / `SnapshotScan`
directly, whereas this class extends `DataScan`.)
##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##########
@@ -696,6 +696,50 @@ public void
testRequiredNestedFieldInOptionalStructFilter() {
sql("DROP TABLE IF EXISTS %s", nestedStructTable);
}
+ @TestTemplate
+ public void testTimeTravelFilterOnRenamedColumn() {
+ // create a separate table for this test
Review Comment:
nit: this narration comment doesn't add much, and the v4.0 copy omits it.
Suggest dropping it so the per-module copies stay identical.
##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##########
@@ -740,6 +740,45 @@ public void simpleTypesInFilter() {
sql("DROP TABLE IF EXISTS %s", tableName);
}
+ @TestTemplate
+ public void testTimeTravelFilterOnRenamedColumn() {
Review Comment:
Two things on this test:
1. It's missing from the `spark/v4.1` module (the latest Spark version, and
the primary target for new tests). Please add it there too. This is the same
point @huan233usc raised.
2. This copy has drifted from the v3.5 copy (v3.5 has explanatory comments,
this one doesn't). Please keep the per-module copies identical.
Same coverage concern as v3.5: without enabling distributed planning, this
test goes through the local scan path and never hits `BaseDistributedDataScan`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]