Hi Micah,

We were just discussing in the Arrow repo how useful it would be to have
utilities that could accurately estimate the deserialized size of a Parquet
file. [1] So I would be very supportive of this.

IIUC the implementation of this should be trivial for many fixed-size
types, although there may be cases that are more complex to track. I'd
definitely be interested to hear from folks who have worked on the
implementations for the other size fields what the level of difficulty is
to implement such a field.

Best,

Will Jones

 [1] https://github.com/apache/arrow/issues/34712

On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield <[email protected]>
wrote:

> Parquet metadata currently tracks uncompressed and compressed page/column
> sizes [1][2].  Uncompressed size here corresponds to encoded size which can
> differ substantially from the plain encoding size due to RLE/Dictionary
> encoding.
>
> When doing query planning/execution it can be useful to understand the
> total raw size of bytes (e.g. whether to do a broad-cast join).
>
> Would people be open to adding an optional field that records the estimated
> (or exact) size of the column if plain encoding had been used?
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> [2]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
>

Reply via email to