FYI https://github.com/apache/parquet-format/pull/197 captures the proposal
and has gone through a few rounds of reviews.  A quick summary:
This adds the following optional metadata:
1.  Variable byte count for BYTE_ARRAY types at the column chunk.
2.  A histogram of repetition and definition levels for the column chunk
3.  A histogram of repetition and definition levels to the page index.

We plan on merging by end of week if there isn't further feedback.



On Mon, Mar 27, 2023 at 2:23 PM wish maple <[email protected]> wrote:

> +1 For uncompressed size for the field. However, it's a bit-tricky here.
> I've
> implement a similar size-hint in our system, here are some problems I met:
> 1. Null variables. In Arrow Array, null-value should occupy some place, but
>     field-raw size cannot represent that value.
> 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
>     variable-size-summary + sizeof(ByteArray) * value-count
> 3. Some time Arrow data is not equal to Parquet data, like Decimal stored
>     as int32 or int64.
> Hope that helps.
>
> Best, Xuwei Fu
>
> On 2023/03/24 16:26:51 Micah Kornfield wrote:
> > Parquet metadata currently tracks uncompressed and compressed page/column
> > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> can
> > differ substantially from the plain encoding size due to RLE/Dictionary
> > encoding.
> >
> > When doing query planning/execution it can be useful to understand the
> > total raw size of bytes (e.g. whether to do a broad-cast join).
> >
> > Would people be open to adding an optional field that records the
> estimated
> > (or exact) size of the column if plain encoding had been used?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > [2]
> >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> >
>

Reply via email to