[I] Dynamically narrow logical types in Parquet writer [datafusion]

via GitHub Mon, 15 Dec 2025 08:20:42 -0800


adriangb opened a new issue, #19337:
URL: https://github.com/apache/datafusion/issues/19337

From [discord
conversation](https://discord.com/channels/885562378132000778/1166447479609376850/1450038250549805086).

The idea is that since we now dynamically adapt predicates/filters on a
per-file basis including adding/removing casts, etc. we could dynamically at
write time narrow the *logical* types we encode in Parquet.

For example, if we are writing a column that is `Int64` in the schema but
for this file in particular the min/max values we see are `(2, 19)` we can add
a `UINT_16` annotation to signal to readers that this column is a `UInt16`
column. I'm not sure but I'd guess this can be done after writing the data as
it's a metadata only operation. Then when a reader sees a predicate like `col >
5` it can cast `5` to `UInt16` and evaluate the whole thing more efficiently. I
guess if it were `col > 9999` it could (just from the metadata, without even
looking at stats) replace that with `false`?.

A next step would be to narrow the physical representation of the data to
save storage costs, etc, but I think that would be more involved since it would
require two passes over the data or something like that.

cc @asubiotto

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Dynamically narrow logical types in Parquet writer [datafusion]

Reply via email to