Re: [PR] Spark 4.1: Simplify description and toString in scans [iceberg]

via GitHub Mon, 09 Feb 2026 17:39:13 -0800


Copilot commented on code in PR #15281:
URL: https://github.com/apache/iceberg/pull/15281#discussion_r2785353967



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -257,15 +261,8 @@ private long totalRecords(Snapshot snapshot) {
   }
 
   @Override
-  public String description() {
-    String groupingKeyFieldNamesAsString =
-        groupingKeyType().fields().stream()
-            .map(Types.NestedField::name)
-            .collect(Collectors.joining(", "));
-
-    return String.format(
-        "%s (branch=%s) [filters=%s, groupedBy=%s]",
-        table(), branch(), Spark3Util.describe(filterExpressions), 
groupingKeyFieldNamesAsString);
+  public String toString() {
+    return description();
   }
 
   @Override

Review Comment:
   This change makes `toString()` delegate to `description()`, but 
`Scan.description()` is commonly implemented as a default method that delegates 
to `toString()` in Spark (per the Scan Javadoc pattern). If `SparkScan` (or any 
subclass) relies on the default `description()`, this can create infinite 
recursion (`description() -> toString() -> description() ...`) and a stack 
overflow. Fix by ensuring `SparkScan` provides a concrete `description()` 
implementation (and have `toString()` call that), or by removing the 
`toString()` override here and letting `description()` remain the single source 
of truth.
   ```suggestion
   
   ```



##########
spark/v4.1/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java:
##########
@@ -3014,7 +3014,7 @@ private void checkJoinAndFilterConditions(String query, 
String join, String iceb
 
           assertThat(planAsString)
               .as("Pushed filters must match")
-              .contains("[filters=" + icebergFilters + ",");
+              .contains("filters=" + icebergFilters + ",");

Review Comment:
   Same issue as above: `runtimeFilters=` includes `filters=` as a substring, 
so this can accidentally match the wrong section of the plan string. Tighten 
the assertion to include an unambiguous delimiter (e.g., `\", filters=\"`) or 
context around the expected field.
   ```suggestion
                 .contains(", filters=" + icebergFilters + ",");
   ```



##########
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestFilterPushDown.java:
##########
@@ -676,7 +676,7 @@ private void checkFilters(
 
     assertThat(planAsString)
         .as("Pushed filters must match")
-        .contains("[filters=" + icebergFilters + ",");
+        .contains("filters=" + icebergFilters + ",");

Review Comment:
   This assertion can produce false positives because `runtimeFilters=` 
contains the substring `filters=`. If `icebergFilters` happens to match the 
runtime filter string, the test could pass even if the main `filters=` section 
is missing/incorrect. Make the match more specific (e.g., include a delimiter 
like `\", filters=\"` or match the full scan prefix such as `\"IcebergScan(\"` 
+ `\"filters=...\"`) so it cannot accidentally match `runtimeFilters=`.
   ```suggestion
           .contains("filters=" + icebergFilters + ", runtimeFilters=");
   ```



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -153,6 +153,10 @@ protected List<Expression> filterExpressions() {
     return filterExpressions;
   }
 
+  protected String filtersDesc() {
+    return Spark3Util.describe(filterExpressions);

Review Comment:
   Using a human-readable description string as a proxy for filter identity is 
risky when it is later used by `equals()`/`hashCode()` in subclasses (as 
introduced in this PR). Different filter expression trees can potentially 
serialize to the same description string (lossy formatting, reordered terms, 
elided parentheses), which can make `equals()` return true for non-equal scans 
and break hash-based collections/caches. Prefer comparing the actual expression 
objects (or a canonical, non-lossy representation) in `equals()`/`hashCode()` 
rather than `Spark3Util.describe(...)` output.
   ```suggestion
       return filterExpressions.stream()
           .map(Expression::toString)
           .collect(Collectors.joining(" AND "));
   ```



##########
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java:
##########
@@ -1022,6 +1023,50 @@ public void testPartitionedOr() throws Exception {
     assertThat(scan.planInputPartitions()).hasSize(4);
   }
 
+  @TestTemplate
+  public void testBatchQueryScanDescription() throws Exception {
+    createPartitionedTable(spark, tableName, "data");
+    SparkScanBuilder builder = scanBuilder();
+
+    withSQLConf(
+        ImmutableMap.of(SparkSQLProperties.PRESERVE_DATA_GROUPING, "true"),
+        () -> {
+          Predicate predicate1 = new Predicate("=", 
expressions(fieldRef("id"), intLit(1)));
+          Predicate predicate2 = new Predicate(">", 
expressions(fieldRef("id"), intLit(0)));
+          pushFilters(builder, predicate1, predicate2);
+
+          Scan scan = builder.build();
+          String description = scan.description();
+
+          assertThat(description).contains("IcebergScan");
+          assertThat(description).contains(tableName);
+          assertThat(description).contains("filters=id = 1, id > 0");
+          assertThat(description).contains("groupedBy=data");
+        });
+  }
+
+  @TestTemplate
+  public void testCopyOnWriteScanDescription() throws Exception {
+    createPartitionedTable(spark, tableName, "data");
+    SparkScanBuilder builder = scanBuilder();
+
+    withSQLConf(
+        ImmutableMap.of(SparkSQLProperties.PRESERVE_DATA_GROUPING, "true"),
+        () -> {
+          Predicate predicate1 = new Predicate("=", 
expressions(fieldRef("id"), intLit(2)));
+          Predicate predicate2 = new Predicate("<", 
expressions(fieldRef("id"), intLit(10)));
+          pushFilters(builder, predicate1, predicate2);
+
+          Scan scan = builder.buildCopyOnWriteScan();
+          String description = scan.description();
+
+          assertThat(description).contains("IcebergCopyOnWriteScan");
+          assertThat(description).contains(tableName);
+          assertThat(description).contains("filters=id = 2, id < 10");

Review Comment:
   These assertions bake in the exact formatting and ordering of the rendered 
filter description (e.g., `\"id = 1, id > 0\"`). If `Spark3Util.describe(...)` 
changes formatting (commas vs AND, spacing, predicate order), the tests will 
fail even though behavior is correct. Consider asserting each predicate 
independently (e.g., contains `\"id = 1\"` and contains `\"id > 0\"`) and 
separately asserting the `filters=` label, rather than asserting the full 
combined string in a single contains check.
   ```suggestion
             assertThat(description).contains("filters=");
             assertThat(description).contains("id = 2");
             assertThat(description).contains("id < 10");
   ```



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java:
##########
@@ -250,4 +250,10 @@ private StructLikeSet 
collectGroupingKeys(Iterable<ScanTaskGroup<T>> taskGroupIt
 
     return keys;
   }
+
+  protected String groupingKeyDesc() {
+    return groupingKeyType().fields().stream()
+        .map(org.apache.iceberg.types.Types.NestedField::name)
+        .collect(Collectors.joining(", "));
+  }

Review Comment:
   Using a fully-qualified class reference inside the stream 
(`org.apache.iceberg.types.Types.NestedField::name`) reduces readability and is 
inconsistent with nearby code that already refers to `Types`. Prefer 
adding/using an import and referencing `Types.NestedField::name` instead.



##########
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java:
##########
@@ -1022,6 +1023,50 @@ public void testPartitionedOr() throws Exception {
     assertThat(scan.planInputPartitions()).hasSize(4);
   }
 
+  @TestTemplate
+  public void testBatchQueryScanDescription() throws Exception {
+    createPartitionedTable(spark, tableName, "data");
+    SparkScanBuilder builder = scanBuilder();
+
+    withSQLConf(
+        ImmutableMap.of(SparkSQLProperties.PRESERVE_DATA_GROUPING, "true"),
+        () -> {
+          Predicate predicate1 = new Predicate("=", 
expressions(fieldRef("id"), intLit(1)));
+          Predicate predicate2 = new Predicate(">", 
expressions(fieldRef("id"), intLit(0)));
+          pushFilters(builder, predicate1, predicate2);
+
+          Scan scan = builder.build();
+          String description = scan.description();
+
+          assertThat(description).contains("IcebergScan");
+          assertThat(description).contains(tableName);
+          assertThat(description).contains("filters=id = 1, id > 0");
+          assertThat(description).contains("groupedBy=data");

Review Comment:
   These assertions bake in the exact formatting and ordering of the rendered 
filter description (e.g., `\"id = 1, id > 0\"`). If `Spark3Util.describe(...)` 
changes formatting (commas vs AND, spacing, predicate order), the tests will 
fail even though behavior is correct. Consider asserting each predicate 
independently (e.g., contains `\"id = 1\"` and contains `\"id > 0\"`) and 
separately asserting the `filters=` label, rather than asserting the full 
combined string in a single contains check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 4.1: Simplify description and toString in scans [iceberg]

Reply via email to