adriangb opened a new issue, #19337: URL: https://github.com/apache/datafusion/issues/19337
From [discord conversation](https://discord.com/channels/885562378132000778/1166447479609376850/1450038250549805086). The idea is that since we now dynamically adapt predicates/filters on a per-file basis including adding/removing casts, etc. we could dynamically at write time narrow the *logical* types we encode in Parquet. For example, if we are writing a column that is `Int64` in the schema but for this file in particular the min/max values we see are `(2, 19)` we can add a `UINT_16` annotation to signal to readers that this column is a `UInt16` column. I'm not sure but I'd guess this can be done after writing the data as it's a metadata only operation. Then when a reader sees a predicate like `col > 5` it can cast `5` to `UInt16` and evaluate the whole thing more efficiently. I guess if it were `col > 9999` it could (just from the metadata, without even looking at stats) replace that with `false`?. A next step would be to narrow the physical representation of the data to save storage costs, etc, but I think that would be more involved since it would require two passes over the data or something like that. cc @asubiotto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
