+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
    field-raw size cannot represent that value.
2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
    variable-size-summary + sizeof(ByteArray) * value-count
3. Some time Arrow data is not equal to Parquet data, like Decimal stored
    as int32 or int64.
Hope that helps.

Best, Xuwei Fu

On 2023/03/24 16:26:51 Micah Kornfield wrote:
> Parquet metadata currently tracks uncompressed and compressed page/column
> sizes [1][2].  Uncompressed size here corresponds to encoded size which
can
> differ substantially from the plain encoding size due to RLE/Dictionary
> encoding.
>
> When doing query planning/execution it can be useful to understand the
> total raw size of bytes (e.g. whether to do a broad-cast join).
>
> Would people be open to adding an optional field that records the
estimated
> (or exact) size of the column if plain encoding had been used?
>
> Thanks,
> Micah
>
> [1]
>
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> [2]
>
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
>

Reply via email to