Re: [PR] [SPARK-57205][SQL] Add SupportsScanMerging to merge equivalent V2 file scans [spark]

via GitHub Fri, 05 Jun 2026 08:54:13 -0700


peter-toth commented on PR #56264:
URL: https://github.com/apache/spark/pull/56264#issuecomment-4633263179


   Thanks for the thorough writeup, @LuciferYang -- you're right on all the 
specifics, and the `pushedFilters` point is a catch I missed.
   
   Agreed that the relation's `pushedFilters` is the narrow set: only filters 
that were fully pushed *and* removed from the post-scan side 
(`normalizedFiltersWithoutSubquery.filterNot(postScanFilterSet.contains)`). For 
best-effort sources, where the `WHERE` is pushed for row-group pruning *and* 
kept as a post-scan `Filter`, it's empty -- so a merge keyed off it can't see 
the predicate it would need to widen, and the naive version would fold two 
different-`WHERE` scans into one full-table scan and silently drop the pruning. 
No argument there.
   
   Where I'm more optimistic is the conclusion. The predicate isn't lost, it's 
just not on the relation: for best-effort sources it's the post-scan `Filter` 
condition, and once the `(Filter, Filter)` symmetric propagation runs it's 
preserved in the `Project` aliases `MergeSubplans` itself builds 
(`propagatedFilter_N` = the original predicate). So a Spark-side merge can 
recover the per-side predicates from the plan it's already standing on -- the 
`Filter(Project(scan))` frame -- OR-widen them, and re-push through the 
standard `SupportsPushDownFilters` / `SupportsPushDownV2Filters` + 
`pruneColumns` contracts. That's the "thread the post-scan residual down to the 
leaf merge" you described, but sourced from the Spark-created `Project`, so it 
stays generic -- no per-connector `mergeWith`.
   
   You're right about layering too: the legacy `sources.Filter` translation 
(`DataSourceStrategy.translateFilter` / `PushDownUtils`) lives in sql/core, so 
it can't all sit in catalyst. But that only constrains the legacy-filter path 
-- i.e. the built-in file sources. The V2 predicate path 
(`V2ExpressionBuilder`) is catalyst-native, so for V2-predicate connectors the 
generic merge can be pure catalyst, with the file-source piece coming in 
through an injected hook rather than per-source code. So I think it still lands 
as "Spark once," and `hasMergeBlockingPushdown` is something I'd want either 
way.
   
   I don't want to block your PR on a hunch, though. Let me spend a bit of time 
next week sketching the generic version so we can put the two side by side, as 
you suggested -- then we can see whether the strict-vs-relaxed boundary really 
dissolves, and decide together whether the DSv2 file sources are worth their 
own code on top of the generic core, or whether landing this first and 
generalizing later is cleaner after all. I'll report back here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57205][SQL] Add SupportsScanMerging to merge equivalent V2 file scans [spark]

Reply via email to