Parquet metadata currently tracks uncompressed and compressed page/column sizes [1][2]. Uncompressed size here corresponds to encoded size which can differ substantially from the plain encoding size due to RLE/Dictionary encoding.
When doing query planning/execution it can be useful to understand the total raw size of bytes (e.g. whether to do a broad-cast join). Would people be open to adding an optional field that records the estimated (or exact) size of the column if plain encoding had been used? Thanks, Micah [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
