peter-toth commented on PR #42223: URL: https://github.com/apache/spark/pull/42223#issuecomment-1657990533
Hi @beliefer, This is a nice optimization and awsome performance improvement to Q28, but I'm not sure the implementation is done the right way: - You probably know that `MergeScalarSubqueries` does a very similar plan merging in case of scalar subqueries so I feel this PR shouldn't reimplement the logic of `MergeScalarSubqueries.tryMergePlans()` but reuse the already available code and add the necessary improvements to it if needed. The plan merging code can also be extracted to a common place to share between the rules. - I guess probably the main reason why you reimplemented merge logic is to be able to propagate diferent filter conditions from `Filter` nodes up into `Aggregate` nodes as that is required for Q28, but currently that feature is missing from `MergeScalarSubqueries.tryMergePlans()`. IMO that improvement should be added to the existing code because `Q9` could also benefit from that. Please note that I've already tried to add that in SPARK-40193 / https://github.com/apache/spark/pull/37630. Unfortunately I think this feature can be tricky as there might be cases when merging queries can introduce prformance degradation. E.g. such bad case is when we merge `SELECT sum(a) FROM t WHERE p = 1` and `SELECT sum(b) FROM t WHERE p = 2` into `SELECT sum(a) FILTER (p = 1), sum(b) FILTER (p = 2) FROM t WHERE p = 1 OR p = 2` and `p` is a partitioning column as we need to process more, but we don't scan less data in the merged query. To avoid that I used a trick in https://github.com/apache/spark/pull/37630 to peek into the physical plan to check that only pushed data filters differ but partitioning and bucketing filters match. That trick made the implementation a bit complex, but we don't need to stick to that, we could also disable filter propagation during plan merging by default as this PR does. Anyway, my point is that probably the feature should be added to a common place (`MergeScalarSubqueries.tryMergePlans()`). So I would suggest narrowing the cope of SPARK-44571 / this PR to merge joined aggregate queries using the existing merge logic and SPARK-40193 to handle filter propagation during plan merging. I'also happy to revive https://github.com/apache/spark/pull/37630 if there is any intterrest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
