adriangb opened a new pull request, #15561: URL: https://github.com/apache/datafusion/pull/15561
Needed for #15301 and #15057. Additionally I think this will make predicate evaluation *slightly* more performant for files with missing columns. There are actually 3 file schemas going around: 1. The table schema, as returned by `TableProvider`, etc. 2. The `file_schema` passed into `FileScanConfig` which is **not** the physical file schema, rather it's the table schema - partition columns. 3. The physical file schema. Currently we build predicates against (2), which means that a predicate may reference columns not found in the actual file. I believe this would result in `null` stats being created on the fly (some minimal work) and pointless evaluation of predicates (some more work). I'm not sure how this stacks up with the extra work of creating the predicates multiple times, that also has a cost. But that should be easier to cache and is O(number of files) instead of O(number of row pages), so I think it should be better. At the very least this is *more correct* in my mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
