adriangb opened a new issue, #19337:
URL: https://github.com/apache/datafusion/issues/19337

   From [discord 
conversation](https://discord.com/channels/885562378132000778/1166447479609376850/1450038250549805086).
   
   The idea is that since we now dynamically adapt predicates/filters on a 
per-file basis including adding/removing casts, etc. we could dynamically at 
write time narrow the *logical* types we encode in Parquet.
   
   For example, if we are writing a column that is `Int64` in the schema but 
for this file in particular the min/max values we see are `(2, 19)` we can add 
a `UINT_16` annotation to signal to readers that this column is a `UInt16` 
column. I'm not sure but I'd guess this can be done after writing the data as 
it's a metadata only operation. Then when a reader sees a predicate like `col > 
5` it can cast `5` to `UInt16` and evaluate the whole thing more efficiently. I 
guess if it were `col > 9999` it could (just from the metadata, without even 
looking at stats) replace that with `false`?.
   
   A next step would be to narrow the physical representation of the data to 
save storage costs, etc, but I think that would be more involved since it would 
require two passes over the data or something like that.
   
   cc @asubiotto 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to