Re: [PR] Spark: time-travel filter fails on renamed columns in BaseDistributedDataScan [iceberg]

via GitHub Wed, 03 Jun 2026 18:04:50 -0700


szehon-ho commented on code in PR #16523:
URL: https://github.com/apache/iceberg/pull/16523#discussion_r3352813348



##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##########
@@ -696,6 +696,50 @@ public void 
testRequiredNestedFieldInOptionalStructFilter() {
     sql("DROP TABLE IF EXISTS %s", nestedStructTable);
   }
 
+  @TestTemplate
+  public void testTimeTravelFilterOnRenamedColumn() {
+    // create a separate table for this test
+    String ttTableName = tableName("tt_rename_table");
+    sql("DROP TABLE IF EXISTS %s", ttTableName);
+    sql("CREATE TABLE %s (id BIGINT, col DOUBLE) USING iceberg", ttTableName);

Review Comment:
   This test doesn't actually exercise the changed code path. 
`BaseDistributedDataScan` (via `SparkDistributedDataScan`) is only used when 
distributed planning is enabled — see `SparkScanBuilder#newBatchScan`, which 
otherwise falls back to `table.newBatchScan()` (the local `DataTableScan` 
path). With the default `read.data-planning-mode=local`, this read never 
reaches `BaseDistributedDataScan`, so the test passes with or without the fix.
   
   To reproduce the original bug, please enable distributed planning, e.g. 
create the table with `TBLPROPERTIES ('read.data-planning-mode'='distributed', 
'read.delete-planning-mode'='distributed')`, and make sure 
`spark.driver.maxResultSize >= 256MB` (required by 
`SparkReadConf#distributedPlanningEnabled`). It'd be good to confirm the test 
fails on `main` without the `specCache` change.



##########
core/src/main/java/org/apache/iceberg/BaseDistributedDataScan.java:
##########
@@ -70,6 +70,11 @@ protected BaseDistributedDataScan(Table table, Schema 
schema, TableScanContext c
     this.localPlanningSizeThreshold = localParallelism * 
LOCAL_PLANNING_MAX_SLOT_SIZE;
   }
 
+  @Override
+  protected boolean useSnapshotSchema() {
+    return true;
+  }

Review Comment:
   This override is redundant and can be removed. `BaseDistributedDataScan 
extends DataScan`, and `DataScan` already overrides `useSnapshotSchema()` to 
return `true`, so this value is already inherited here. That means the 
`specCache()` change below is the entire fix — this override is a no-op.
   
   (Also, it isn't really "consistent with DataTableScan": `DataTableScan` 
needs the override because it extends `BaseTableScan` / `SnapshotScan` 
directly, whereas this class extends `DataScan`.)



##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##########
@@ -696,6 +696,50 @@ public void 
testRequiredNestedFieldInOptionalStructFilter() {
     sql("DROP TABLE IF EXISTS %s", nestedStructTable);
   }
 
+  @TestTemplate
+  public void testTimeTravelFilterOnRenamedColumn() {
+    // create a separate table for this test

Review Comment:
   nit: this narration comment doesn't add much, and the v4.0 copy omits it. 
Suggest dropping it so the per-module copies stay identical.



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##########
@@ -740,6 +740,45 @@ public void simpleTypesInFilter() {
     sql("DROP TABLE IF EXISTS %s", tableName);
   }
 
+  @TestTemplate
+  public void testTimeTravelFilterOnRenamedColumn() {

Review Comment:
   Two things on this test:
   1. It's missing from the `spark/v4.1` module (the latest Spark version, and 
the primary target for new tests). Please add it there too. This is the same 
point @huan233usc raised.
   2. This copy has drifted from the v3.5 copy (v3.5 has explanatory comments, 
this one doesn't). Please keep the per-module copies identical.
   
   Same coverage concern as v3.5: without enabling distributed planning, this 
test goes through the local scan path and never hits `BaseDistributedDataScan`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: time-travel filter fails on renamed columns in BaseDistributedDataScan [iceberg]

Reply via email to