Re: [PR] Predicate Pushdown in Spark Structured Streaming (DataSource V2). [spark]

via GitHub Tue, 05 May 2026 08:48:41 -0700


thesquelched commented on PR #55679:
URL: https://github.com/apache/spark/pull/55679#issuecomment-4380857588


   Unfortunately, this doesn't work. I tested by creating a table with two 
partitions, then running four reads:
   
   1. Batch read, without partition filter
   2. Batch read, with partition filter
   3. Stream read, without partition filter
   4. Stream read, with partition filter
   
   Here's the code and output (lightly edited to remove logspam):
   
   ```
   $ spark-shell \
       --jars 
./spark/v4.1/spark-runtime/build/libs/iceberg-spark-runtime-4.1_2.13-1.11.0-SNAPSHOT.jar
 \
       --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
       --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
       --conf spark.sql.catalog.spark_catalog.type=hadoop \
       --conf spark.sql.catalog.spark_catalog.warehouse=$PWD/testdata
   
   WARNING: Using incubator modules: jdk.incubator.vector
   26/05/05 10:36:46 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Using Spark's default log4j profile: 
org/apache/spark/log4j2-defaults.properties
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 4.1.1
         /_/
   
   Using Scala version 2.13.17 (OpenJDK 64-Bit Server VM, Java 17.0.18)
   Type in expressions to have them evaluated.
   Type :help for more information.
   Spark context Web UI available at http://192.168.1.45:4040
   Spark context available as 'sc' (master = local[*], app id = 
local-1777995408212).
   Spark session available as 'spark'.
   
   scala> import spark.implicits._
   import spark.implicits._
   
   scala> import org.apache.spark.sql.streaming._
   import org.apache.spark.sql.streaming._
   
   scala> spark.range(100L).select('id, 'id % 2 as 'bucket, 
concat(lit("value"), 'id) as 'value).repartition(1, 'bucket)
   
.writeTo("mytable").using("iceberg").partitionedBy('bucket).tableProperty("write.distribution-mode",
 "none").create()
   
   scala> spark.table("mytable").filter('id.isin(0, 1)).show
   +---+------+------+
   | id|bucket| value|
   +---+------+------+
   |  0|     0|value0|
   |  1|     1|value1|
   +---+------+------+
   
   scala> spark.table("mytable").filter('bucket === 0).filter('id.isin(0, 
1)).show
   +---+------+------+
   | id|bucket| value|
   +---+------+------+
   |  0|     0|value0|
   +---+------+------+
   
   scala> 
spark.readStream.format("iceberg").table("mytable").writeStream.format("memory").trigger(Trigger.AvailableNow()).queryName("foo").start.awaitTermination
   26/05/05 10:37:54 ERROR SparkScan: [jalpan] creating micro batch stream
   26/05/05 10:37:54 ERROR SparkMicroBatchStream: [jalpan] creating micro batch 
with filter []
   26/05/05 10:37:55 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:37:55 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:37:55 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:37:55 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:37:55 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:37:55 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   
   scala> spark.readStream.format("iceberg").table("mytable").filter('bucket 
=== 0).writeStream.format("memory").trigger
   (Trigger.AvailableNow()).queryName("foo").start.awaitTermination
   26/05/05 10:38:11 ERROR SparkScan: [jalpan] creating micro batch stream
   26/05/05 10:38:11 ERROR SparkMicroBatchStream: [jalpan] creating micro batch 
with filter []
   26/05/05 10:38:11 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:38:11 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:38:11 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:38:11 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:38:11 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   26/05/05 10:38:11 ERROR SparkMicroBatchStream: [jalpan] planning input 
partitions micro batch with filter []
   ```
   
   Here's the UI metrics for...
   
   1. Batch read, without partition filter (2 files read) ✅ 
   
   <img width="472" height="1138" alt="Screenshot 2026-05-05 at 10 37 43 AM" 
src="https://github.com/user-attachments/assets/763cc355-b0f8-430c-88ba-f90811ce77d9";
 />
   <br/>
   
   2. Batch read, with partition filter (1 file read) ✅ 
   
   <img width="474" height="1148" alt="Screenshot 2026-05-05 at 10 37 50 AM" 
src="https://github.com/user-attachments/assets/5c111777-10ad-4e6b-9f2f-cc63886e3eab";
 />
   <br/>
   
   3. Stream read, without partition filter (2 files read) ✅ 
   
   <img width="479" height="993" alt="Screenshot 2026-05-05 at 10 38 31 AM" 
src="https://github.com/user-attachments/assets/d56dfea4-d202-43bc-bfda-68c915d08ad0";
 />
   <br/>
   
   4. Stream read, with partition filter (2 files read) ❌ 
   
   <img width="473" height="1128" alt="Screenshot 2026-05-05 at 10 38 45 AM" 
src="https://github.com/user-attachments/assets/b61f23b5-12f3-4ecc-bfd8-5fa4f3284222";
 />
   <br/>
   
   You can see that there's no difference in the files read in the streaming 
read, indicating that the predicate pushdown did not happen.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Predicate Pushdown in Spark Structured Streaming (DataSource V2). [spark]

Reply via email to