huaxingao commented on pull request #33650: URL: https://github.com/apache/spark/pull/33650#issuecomment-903856178
Agree to add `SupportsPushDownCatalystFilters` for pushing down catalyst `Expression` filters. Seems to me `pushFilters` is more suitable for sql based datasource: 1. `pushFilters` returns filters that need to be evaluated after scanning. Only in sql based datasource, we need to return filters that need to be evaluated for post scan. In file source, we need to re-evaluate all the filters. 2. `pushFilters` pushes `sources.Filter`. sql based datasource only needs the `sources.Filter`, and currently it has one copy of filters in the format of `sources.Filter`. But file sources currently have two copy of filters: one is in the format of `sources.Filter` which is pushed down in `pushFilters`, and another is in the format of `Expression` which is pushed down in `PruneFileSourcePartitions`. Seems to me that it is more reasonable to push down once and maintain one copy. We have to push down in the format of `Expression` because this `Expression` is used for partition pruning. We are not changing the `pushFilters`, though. The users who implement this `pushFilters` in their file source can still use this `pushFilters` as what they do currently. I guess this will not break any of the current applications? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
