avaskys opened a new issue, #15692:
URL: https://github.com/apache/iceberg/issues/15692
### Feature Request / Improvement
Spark batch reads from Iceberg tables push down their filter expressions,
enabling manifest-level pruning (via partition range summaries), file-level
pruning (via column min/max statistics), and partition elimination. Spark
structured streaming reads do not currently benefit from any of this, but it
would be valuable to support filter pushdown in the `MicroBatchStream` path as
well.
Today, a streaming query like
`.readStream.format("iceberg").load("t").filter("partition_col = 'foo'")` will
create Spark tasks for and read files across all partitions. The filter is only
applied as a post-read record filter by Spark. For streaming reads with
partition filters, this can create significant unnecessary I/O, task overhead,
and compute cost.
The core API already supports this. `IncrementalAppendScan` inherits
`filter(Expression)` from the `Scan` interface, and `BaseIncrementalAppendScan`
correctly threads it to `ManifestGroup.filterData()` for the full pruning
pipeline. The gap is in the Spark connector: `SparkScan.toMicroBatchStream()`
does not pass filter expressions to `SparkMicroBatchStream`, so they are never
applied.
Closing this gap would bring streaming reads to parity with batch reads for
filter pushdown, benefiting both partition-based and column statistics-based
pruning.
Affects all maintained Spark connector versions: v3.4, v3.5, v4.0, v4.1.
### Query engine
Spark
### Willingness to contribute
- [x] I can contribute this improvement/feature independently
- [x] I would be willing to contribute this improvement/feature with
guidance from the Iceberg community
- [ ] I cannot contribute this improvement/feature at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]