[GitHub] [spark] cloud-fan commented on pull request #33650: [SPARK-36351][SQL] Separate partition filters and data filters in PushDownUtils

GitBox Thu, 12 Aug 2021 00:07:55 -0700


cloud-fan commented on pull request #33650:
URL: https://github.com/apache/spark/pull/33650#issuecomment-897401206



   I'd like to discuss the high-level idea first. Ideally, the process of ds v2 
filter pushdown should be
   1. the rule `V2ScanRelationPushDown` pushes down filters to the v2 source
   2. the v2 source looks at the pushed filters, and returns a list of filters 
that Spark should evaluate. These "post scan" filters can either be filters 
that are not supported by this v2 source, or the v2 source can't fully finish 
the filtering and need Spark to do the filter again.
   3. for file source, we need Spark to evaluate data filters, as 
statistics-based filtering can't 100% filter out unneeded data. We don't need 
Spark to evaluate the partition filters again as file source can filter out 
partitions exactly.
   4. then spark continues the query compiling, applies CBO and finally submits 
spark jobs.
   
   I think the key problem we should fix is: file source v2 should not return 
partition filters as the "post scan" filters, in the implementation of 
`SupportsPushDownFilters.pushFilters`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #33650: [SPARK-36351][SQL] Separate partition filters and data filters in PushDownUtils

Reply via email to