Re: [PR] Spark 4.0, Core: Add Limit pushdown to Scan [iceberg]

via GitHub Tue, 25 Nov 2025 15:16:22 -0800


singhpk234 commented on code in PR #14615:
URL: https://github.com/apache/iceberg/pull/14615#discussion_r2561859605



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java:
##########
@@ -150,6 +150,16 @@ public void testSelectRewrite() {
         .isEqualTo("(float IS NOT NULL AND is_nan(float))");
   }
 
+  @TestTemplate
+  public void selectWithLimit() {
+    Object[] first = row(1L, "a", 1.0F);
+    Object[] second = row(2L, "b", 2.0F);
+    Object[] third = row(3L, "c", Float.NaN);
+    assertThat(sql("SELECT * FROM %s LIMIT 1", 
tableName)).containsExactly(first);
+    assertThat(sql("SELECT * FROM %s LIMIT 2", 
tableName)).containsExactly(first, second);
+    assertThat(sql("SELECT * FROM %s LIMIT 3", 
tableName)).containsExactly(first, second, third);

Review Comment:
   My understanding is this would succeed even without this change, how about 
inspecting the plan and extracting the plan node to check the limit if 
percolated deep down to the scan node 
   
   ```
   LogicalPlan logicalPlan = limitedDf.queryExecution().optimizedPlan();
   Optional<Integer> limit =
           JavaConverters.asJavaCollection(logicalPlan.collectLeaves()).stream()
               .flatMap(
                   plan -> {
                     if (!(plan instanceof DataSourceV2ScanRelation)) {
                       return Stream.empty();
                     }
   
                     DataSourceV2ScanRelation scanRelation = 
(DataSourceV2ScanRelation) plan;
                     if (!(scanRelation.scan() instanceof SparkBatchQueryScan)) 
{
                       return Stream.empty();
                     }
   
                     SparkBatchQueryScan batchQueryScan = (SparkBatchQueryScan) 
scanRelation.scan();
                     return Stream.ofNullable(batchQueryScan.pushedLimit());
                   })
               .findFirst();
   
       return limit.orElse(null);
   ```
   
   credits: 
https://github.com/apache/iceberg/pull/10943/files#diff-5de7aa01c6be719c3085c1f0416ed700fb78358103c8b26c6e3ecb9f063c04dbR270



##########
core/src/main/java/org/apache/iceberg/BaseScan.java:
##########
@@ -293,6 +293,11 @@ public ThisT metricsReporter(MetricsReporter reporter) {
     return newRefinedScan(table, schema, context.reportWith(reporter));
   }
 
+  @Override
+  public ThisT minRowsRequested(long numRows) {
+    return newRefinedScan(table, schema, context.minRowsRequested(numRows));
+  }

Review Comment:
   make sense, can we add a test for metadata table read as it would trigger a 
different scan obj, wdyt ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 4.0, Core: Add Limit pushdown to Scan [iceberg]

Reply via email to