+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
    field-raw size cannot represent that value.
2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
    variable-size-summary + sizeof(ByteArray) * value-count
3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
    as int32 or int64.
Hope that helps.

Best, Xuwei Fu

On 2023/03/24 16:59:31 Will Jones wrote:
> Hi Micah,
>
> We were just discussing in the Arrow repo how useful it would be to have
> utilities that could accurately estimate the deserialized size of a
Parquet
> file. [1] So I would be very supportive of this.
>
> IIUC the implementation of this should be trivial for many fixed-size
> types, although there may be cases that are more complex to track. I'd
> definitely be interested to hear from folks who have worked on the
> implementations for the other size fields what the level of difficulty is
> to implement such a field.
>
> Best,
>
> Will Jones
>
>  [1] https://github.com/apache/arrow/issues/34712
>
> On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield <[email protected]>
> wrote:
>
> > Parquet metadata currently tracks uncompressed and compressed
page/column
> > sizes [1][2].  Uncompressed size here corresponds to encoded size which
can
> > differ substantially from the plain encoding size due to RLE/Dictionary
> > encoding.
> >
> > When doing query planning/execution it can be useful to understand the
> > total raw size of bytes (e.g. whether to do a broad-cast join).
> >
> > Would people be open to adding an optional field that records the
estimated
> > (or exact) size of the column if plain encoding had been used?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > [2]
> >
> >
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> >
>

Reply via email to