peter-toth commented on PR #42223: URL: https://github.com/apache/spark/pull/42223#issuecomment-1665208447
> I don't think putting the predicates in a Project helps as the problem is from scan prunning. Sorry, my first comment about the extra project (https://github.com/apache/spark/pull/42223#discussion_r1283307075 was confusing. I got that argument, and I agreed with you (https://github.com/apache/spark/pull/42223#discussion_r1283361954). But if we decided to merge the queries (based on any heuristics) then the extra project can help to avoid evaluating the filters 2 times in the merged query and actually can decrease data to be shuffled for the aggregation: https://github.com/apache/spark/pull/42223#discussion_r1284057348. IMO the extra project beween the filter and scan won't prevent the merged condition to be pushed-down to the scan so i don't see any drawbacks of it. > Given that it's hard to estimate the overlapped scan size, the only idea I can think of is to estimate the cost of the predicate. We only do the merge if the predicate is cheap to evaluate: only contains simple comparison and the expression tree size is small. That makes sense to me and I don't have any better idea. Although I still think we can and should check it any filters are pushed down to scans in the original queries. If there is no pushed-down filters or pushed-down filters fully match then we are safe to merge as scans fully overlap. If there are non-matching pushed-down filters then we can use the suggestged expression is cheap heuristics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
