LuciferYang opened a new pull request, #56264: URL: https://github.com/apache/spark/pull/56264
### What changes were proposed in this pull request? This adds a `SupportsScanMerging` mix-in for DSv2 `Scan` and teaches `MergeSubplans` (through `PlanMerger`) to use it. When two scans of the same table differ only in their projected columns and/or pushed filters, the optimizer asks the source to fuse them into one scan that covers both, and then deduplicates the subplans built on top. The interface is implemented once on the `FileScan` base, so every built-in file format (Parquet, ORC, CSV, JSON, Text, Avro) participates. Two scans merge when they read the same data (same file index, schema, options and partition filters); the merged scan reads the union of the two read schemas. When the pushed data filters differ, the merged scan widens them to `OR(f1, f2)` and reads a superset -- the exact per-side predicate is still enforced by the post-scan `Filter`, which `MergeSubplans` turns into per-aggregate `FILTER (WHERE ...)` clauses. That widening is off by default, gated by `spark.sql.files.scanMerge.ignorePushedDataFilters`. Merging is declined when an aggregate is pushed into the scan, since there is then no post-scan `Filter` to separate the two sides. ### Why are the changes needed? V1 file sources already get this. Their filter and column pushdown happens during physical planning, so when `MergeSubplans` runs the logical plan still has plain `Filter`/`Project` nodes over identical relation leaves, which the existing rules merge. DSv2 bakes pushdown into `DataSourceV2ScanRelation` during logical optimization, so two subqueries that differ only in `WHERE` or `SELECT` become structurally different leaves and cannot be merged. For example TPC-DS q9 (fifteen scalar subqueries over `store_sales`) collapses to a single scan under V1 but reads the table fifteen times under V2. This closes that gap. ### Does this PR introduce _any_ user-facing change? No. The connector interface and the config are internal, and the config defaults to off, so the default plan is unchanged. ### How was this patch tested? New tests in `PlanMergeSuite` run each query through both the V1 and V2 file-source paths, with AQE on and off, and assert identical results and identical merge structure. The merge-semantics cases cover differing columns, differing data filters, differing partition filters, filter propagation through a join, and a multi-subquery composition, plus negative cases; the format-coverage cases run a representative query over Parquet, ORC and JSON. `MergeSubplansSuite`, `DataSourceV2Suite` and the Parquet/ORC V2 aggregate-pushdown suites still pass. I also checked that V1 and V2 produce identical results and merge structure on the affected TPC-DS queries (q2, q9, q28, q59, q77a, q88, q90) at scale factor 1. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
