[PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

via GitHub Thu, 03 Apr 2025 07:36:24 -0700


adriangb opened a new pull request, #15561:
URL: https://github.com/apache/datafusion/pull/15561


   Needed for #15301 and #15057.
   
   Additionally I think this will make predicate evaluation *slightly* more 
performant for files with missing columns.
   There are actually 3 file schemas going around:
   1. The table schema, as returned by `TableProvider`, etc.
   2. The `file_schema` passed into `FileScanConfig` which is **not** the 
physical file schema, rather it's the table schema - partition columns.
   3. The physical file schema.
   
   Currently we build predicates against (2), which means that a predicate may 
reference columns not found in the actual file. I believe this would result in 
`null` stats being created on the fly (some minimal work) and pointless 
evaluation of predicates (some more work).
   I'm not sure how this stacks up with the extra work of creating the 
predicates multiple times, that also has a cost. But that should be easier to 
cache and is O(number of files) instead of O(number of row pages), so I think 
it should be better.
   
   At the very least this is *more correct* in my mind.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

Reply via email to