Hi Micah, We were just discussing in the Arrow repo how useful it would be to have utilities that could accurately estimate the deserialized size of a Parquet file. [1] So I would be very supportive of this.
IIUC the implementation of this should be trivial for many fixed-size types, although there may be cases that are more complex to track. I'd definitely be interested to hear from folks who have worked on the implementations for the other size fields what the level of difficulty is to implement such a field. Best, Will Jones [1] https://github.com/apache/arrow/issues/34712 On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield <[email protected]> wrote: > Parquet metadata currently tracks uncompressed and compressed page/column > sizes [1][2]. Uncompressed size here corresponds to encoded size which can > differ substantially from the plain encoding size due to RLE/Dictionary > encoding. > > When doing query planning/execution it can be useful to understand the > total raw size of bytes (e.g. whether to do a broad-cast join). > > Would people be open to adding an optional field that records the estimated > (or exact) size of the column if plain encoding had been used? > > Thanks, > Micah > > [1] > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728 > [2] > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637 >
