LuciferYang opened a new pull request, #56264:
URL: https://github.com/apache/spark/pull/56264

   ### What changes were proposed in this pull request?
   
   This adds a `SupportsScanMerging` mix-in for DSv2 `Scan` and teaches
   `MergeSubplans` (through `PlanMerger`) to use it. When two scans of the same
   table differ only in their projected columns and/or pushed filters, the
   optimizer asks the source to fuse them into one scan that covers both, and 
then
   deduplicates the subplans built on top.
   
   The interface is implemented once on the `FileScan` base, so every built-in
   file format (Parquet, ORC, CSV, JSON, Text, Avro) participates. Two scans 
merge
   when they read the same data (same file index, schema, options and partition
   filters); the merged scan reads the union of the two read schemas. When the
   pushed data filters differ, the merged scan widens them to `OR(f1, f2)` and
   reads a superset -- the exact per-side predicate is still enforced by the
   post-scan `Filter`, which `MergeSubplans` turns into per-aggregate
   `FILTER (WHERE ...)` clauses. That widening is off by default, gated by
   `spark.sql.files.scanMerge.ignorePushedDataFilters`. Merging is declined 
when an
   aggregate is pushed into the scan, since there is then no post-scan `Filter` 
to
   separate the two sides.
   
   ### Why are the changes needed?
   
   V1 file sources already get this. Their filter and column pushdown happens
   during physical planning, so when `MergeSubplans` runs the logical plan still
   has plain `Filter`/`Project` nodes over identical relation leaves, which the
   existing rules merge. DSv2 bakes pushdown into `DataSourceV2ScanRelation` 
during
   logical optimization, so two subqueries that differ only in `WHERE` or 
`SELECT`
   become structurally different leaves and cannot be merged. For example 
TPC-DS q9
   (fifteen scalar subqueries over `store_sales`) collapses to a single scan 
under
   V1 but reads the table fifteen times under V2. This closes that gap.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The connector interface and the config are internal, and the config
   defaults to off, so the default plan is unchanged.
   
   ### How was this patch tested?
   
   New tests in `PlanMergeSuite` run each query through both the V1 and V2
   file-source paths, with AQE on and off, and assert identical results and
   identical merge structure. The merge-semantics cases cover differing columns,
   differing data filters, differing partition filters, filter propagation 
through
   a join, and a multi-subquery composition, plus negative cases; the
   format-coverage cases run a representative query over Parquet, ORC and JSON.
   `MergeSubplansSuite`, `DataSourceV2Suite` and the Parquet/ORC V2
   aggregate-pushdown suites still pass. I also checked that V1 and V2 produce
   identical results and merge structure on the affected TPC-DS queries (q2, q9,
   q28, q59, q77a, q88, q90) at scale factor 1.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.8
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to