[GitHub] [spark] peter-toth commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

via GitHub Mon, 31 Jul 2023 02:19:59 -0700


peter-toth commented on PR #42223:
URL: https://github.com/apache/spark/pull/42223#issuecomment-1657990533


   Hi @beliefer,
   This is a nice optimization and awsome performance improvement to Q28, but 
I'm not sure the implementation is done the right way:
   - You probably know that `MergeScalarSubqueries` does a very similar plan 
merging in case of scalar subqueries so I feel this PR shouldn't reimplement 
the logic of `MergeScalarSubqueries.tryMergePlans()` but reuse the already 
available code and add the necessary improvements to it if needed. The plan 
merging code can also be extracted to a common place to share between the rules.
   - I guess probably the main reason why you reimplemented merge logic is to 
be able to propagate diferent filter conditions from `Filter` nodes up into 
`Aggregate` nodes as that is required for Q28, but currently that feature is 
missing from `MergeScalarSubqueries.tryMergePlans()`. IMO that improvement 
should be added to the existing code because `Q9` could also benefit from that. 
Please note that I've already tried to add that in SPARK-40193 / 
https://github.com/apache/spark/pull/37630. 
     Unfortunately I think this feature can be tricky as there might be cases 
when merging queries can introduce prformance degradation. E.g. such bad case 
is when we merge `SELECT sum(a) FROM t WHERE p = 1` and  `SELECT sum(b) FROM t 
WHERE p = 2` into `SELECT sum(a) FILTER (p = 1), sum(b) FILTER (p = 2) FROM t 
WHERE p = 1 OR p = 2` and `p` is a partitioning column as we need to process 
more, but we don't scan less data in the merged query. To avoid that I used a 
trick in https://github.com/apache/spark/pull/37630 to peek into the physical 
plan to check that only pushed data filters differ but partitioning and 
bucketing filters match. That trick made the implementation a bit complex, but 
we don't need to stick to that, we could also disable filter propagation during 
plan merging by default as this PR does.
     Anyway, my point is that probably the feature should be added to a common 
place (`MergeScalarSubqueries.tryMergePlans()`).
   
   So I would suggest narrowing the cope of SPARK-44571 / this PR to merge 
joined aggregate queries using the existing merge logic and SPARK-40193 to 
handle filter propagation during plan merging. I'also happy to revive 
https://github.com/apache/spark/pull/37630 if there is any intterrest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

Reply via email to