rdblue commented on issue #24327: [SPARK-27418][SQL] Migrate Parquet to File Data Source V2 URL: https://github.com/apache/spark/pull/24327#issuecomment-498069887 > Spark needs to read the actual physical schema for getting the exact names and data types for pushing down filters Yes, filters need to be converted for every Parquet file to ensure the filter is evaluated correctly. This is to ensure case matches because Parquet is case sensitive. When reporting the filters that were pushed down (`pushedFilters`), this should use filters that can be converted to Parquet filters. It works to pick any data file, convert, and report those filters. These filters are informational. So what this should return is the set of filters that were converted for any data file. What can affect correctness are the filters that are returned by `pushFilters` (not `pushedFilters`). That method's documentation says: > Pushes down filters, and returns filters that need to be evaluated after scanning. If you don't know what filters will be applied, then you're allowed to return all filters and Spark will add a `Filter` on top of the scan. This is the behavior we probably use anyway because Spark's codegen filter is probably faster than Parquet's record filter. I don't think there is a problem with the API: if you don't know what filters will be evaluated, return them all in `pushFilters`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
