[GitHub] [spark] peter-toth commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

via GitHub Fri, 04 Aug 2023 01:21:17 -0700


peter-toth commented on PR #42223:
URL: https://github.com/apache/spark/pull/42223#issuecomment-1665208447


   > I don't think putting the predicates in a Project helps as the problem is 
from scan prunning.
   
   Sorry, my first comment about the extra project 
(https://github.com/apache/spark/pull/42223#discussion_r1283307075 was 
confusing. I got that argument, and I agreed with you 
(https://github.com/apache/spark/pull/42223#discussion_r1283361954).
   
   But if we decided to merge the queries (based on any heuristics) then the 
extra project can help to avoid evaluating the filters 2 times in the merged 
query and actually can decrease data to be shuffled for the aggregation: 
https://github.com/apache/spark/pull/42223#discussion_r1284057348. IMO the 
extra project beween the filter and scan won't prevent the merged condition to 
be pushed-down to the scan so i don't see any drawbacks of it.
   
   > Given that it's hard to estimate the overlapped scan size, the only idea I 
can think of is to estimate the cost of the predicate. We only do the merge if 
the predicate is cheap to evaluate: only contains simple comparison and the 
expression tree size is small.
   
   That makes sense to me and I don't have any better idea. Although I still 
think we can and should check it any filters are pushed down to scans in the 
original queries. If there is no pushed-down filters or pushed-down filters 
fully match then we are safe to merge as scans fully overlap. If there are 
non-matching pushed-down filters then we can use the suggestged expression is 
cheap heuristics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

Reply via email to