cloud-fan commented on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-897401206
I'd like to discuss the high-level idea first. Ideally, the process of ds v2 filter pushdown should be 1. the rule `V2ScanRelationPushDown` pushes down filters to the v2 source 2. the v2 source looks at the pushed filters, and returns a list of filters that Spark should evaluate. These "post scan" filters can either be filters that are not supported by this v2 source, or the v2 source can't fully finish the filtering and need Spark to do the filter again. 3. for file source, we need Spark to evaluate data filters, as statistics-based filtering can't 100% filter out unneeded data. We don't need Spark to evaluate the partition filters again as file source can filter out partitions exactly. 4. then spark continues the query compiling, applies CBO and finally submits spark jobs. I think the key problem we should fix is: file source v2 should not return partition filters as the "post scan" filters, in the implementation of `SupportsPushDownFilters.pushFilters` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
