peter-toth commented on PR #56264: URL: https://github.com/apache/spark/pull/56264#issuecomment-4633263179
Thanks for the thorough writeup, @LuciferYang -- you're right on all the specifics, and the `pushedFilters` point is a catch I missed. Agreed that the relation's `pushedFilters` is the narrow set: only filters that were fully pushed *and* removed from the post-scan side (`normalizedFiltersWithoutSubquery.filterNot(postScanFilterSet.contains)`). For best-effort sources, where the `WHERE` is pushed for row-group pruning *and* kept as a post-scan `Filter`, it's empty -- so a merge keyed off it can't see the predicate it would need to widen, and the naive version would fold two different-`WHERE` scans into one full-table scan and silently drop the pruning. No argument there. Where I'm more optimistic is the conclusion. The predicate isn't lost, it's just not on the relation: for best-effort sources it's the post-scan `Filter` condition, and once the `(Filter, Filter)` symmetric propagation runs it's preserved in the `Project` aliases `MergeSubplans` itself builds (`propagatedFilter_N` = the original predicate). So a Spark-side merge can recover the per-side predicates from the plan it's already standing on -- the `Filter(Project(scan))` frame -- OR-widen them, and re-push through the standard `SupportsPushDownFilters` / `SupportsPushDownV2Filters` + `pruneColumns` contracts. That's the "thread the post-scan residual down to the leaf merge" you described, but sourced from the Spark-created `Project`, so it stays generic -- no per-connector `mergeWith`. You're right about layering too: the legacy `sources.Filter` translation (`DataSourceStrategy.translateFilter` / `PushDownUtils`) lives in sql/core, so it can't all sit in catalyst. But that only constrains the legacy-filter path -- i.e. the built-in file sources. The V2 predicate path (`V2ExpressionBuilder`) is catalyst-native, so for V2-predicate connectors the generic merge can be pure catalyst, with the file-source piece coming in through an injected hook rather than per-source code. So I think it still lands as "Spark once," and `hasMergeBlockingPushdown` is something I'd want either way. I don't want to block your PR on a hunch, though. Let me spend a bit of time next week sketching the generic version so we can put the two side by side, as you suggested -- then we can see whether the strict-vs-relaxed boundary really dissolves, and decide together whether the DSv2 file sources are worth their own code on top of the generic core, or whether landing this first and generalizing later is cleaner after all. I'll report back here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
