rdblue commented on issue #24327: [SPARK-27418][SQL] Migrate Parquet to File 
Data Source V2
URL: https://github.com/apache/spark/pull/24327#issuecomment-498069887
 
 
   > Spark needs to read the actual physical schema for getting the exact names 
and data types for pushing down filters
   
   Yes, filters need to be converted for every Parquet file to ensure the 
filter is evaluated correctly. This is to ensure case matches because Parquet 
is case sensitive.
   
   When reporting the filters that were pushed down (`pushedFilters`), this 
should use filters that can be converted to Parquet filters. It works to pick 
any data file, convert, and report those filters. These filters are 
informational. So what this should return is the set of filters that were 
converted for any data file.
   
   What can affect correctness are the filters that are returned by 
`pushFilters` (not `pushedFilters`). That method's documentation says:
   
   > Pushes down filters, and returns filters that need to be evaluated after 
scanning.
   
   If you don't know what filters will be applied, then you're allowed to 
return all filters and Spark will add a `Filter` on top of the scan. This is 
the behavior we probably use anyway because Spark's codegen filter is probably 
faster than Parquet's record filter.
   
   I don't think there is a problem with the API: if you don't know what 
filters will be evaluated, return them all in `pushFilters`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to