[DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Micah Kornfield Fri, 24 Mar 2023 09:27:51 -0700

Parquet metadata currently tracks uncompressed and compressed page/column
sizes [1][2].  Uncompressed size here corresponds to encoded size which can
differ substantially from the plain encoding size due to RLE/Dictionary
encoding.


When doing query planning/execution it can be useful to understand the
total raw size of bytes (e.g. whether to do a broad-cast join).

Would people be open to adding an optional field that records the estimated
(or exact) size of the column if plain encoding had been used?

Thanks,
Micah

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637

[DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Reply via email to