Parquet metadata currently tracks uncompressed and compressed page/column
sizes [1][2].  Uncompressed size here corresponds to encoded size which can
differ substantially from the plain encoding size due to RLE/Dictionary
encoding.

When doing query planning/execution it can be useful to understand the
total raw size of bytes (e.g. whether to do a broad-cast join).

Would people be open to adding an optional field that records the estimated
(or exact) size of the column if plain encoding had been used?

Thanks,
Micah

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637

Reply via email to