Re: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Micah Kornfield Sat, 25 Mar 2023 16:41:28 -0700

>
> 1. Null variables. In Arrow Array, null-value should occupy some place, but
>     field-raw size cannot represent that value.

This is a good point.  The number of nulls can be inferred from statistics
or is included in data-page v2 [1].  I'd rather not bake in assumptions
about size of nulls as different systems can represent them differently and
I would prefer keep this memory representation agnostic.  I'm open to
thoughts here?

> 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
>     variable-size-summary + sizeof(ByteArray) * value-count

My suggestion here is the size of the plain encoded values because the
encoding is already well defined in Parquet.  I think for FLBA this ends up
being equal to ["size of column" * number of non-null values]/. for byte
array the formula listed here is what plain encoding would work out to
where sizeof(ByteArray) = 4.    I think this is preferable but let me know
if this doesn't cover your use-case.

> 3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
>     as int32 or int64.
> Hope that helps.

Yes, my intent would be to keep this agnostic from other systems but I
think the information allows other systems to use the estimate reasonably
well or back out their own computation.  The size of the Decimal values in
Arrow can determines by precisions and scale of the column, the chosen
Arrow decimal width and the number of values.

Best, Xuwei Fu

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L563

On Sat, Mar 25, 2023 at 9:36 AM wish maple <[email protected]> wrote:

> +1 For uncompressed size for the field. However, it's a bit-tricky here.
> I've
> implement a similar size-hint in our system, here are some problems I met:
> 1. Null variables. In Arrow Array, null-value should occupy some place, but
>     field-raw size cannot represent that value.
> 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
>     variable-size-summary + sizeof(ByteArray) * value-count
> 3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
>     as int32 or int64.
> Hope that helps.
>
> Best, Xuwei Fu
>
> On 2023/03/24 16:59:31 Will Jones wrote:
> > Hi Micah,
> >
> > We were just discussing in the Arrow repo how useful it would be to have
> > utilities that could accurately estimate the deserialized size of a
> Parquet
> > file. [1] So I would be very supportive of this.
> >
> > IIUC the implementation of this should be trivial for many fixed-size
> > types, although there may be cases that are more complex to track. I'd
> > definitely be interested to hear from folks who have worked on the
> > implementations for the other size fields what the level of difficulty is
> > to implement such a field.
> >
> > Best,
> >
> > Will Jones
> >
> >  [1] https://github.com/apache/arrow/issues/34712
> >
> > On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > Parquet metadata currently tracks uncompressed and compressed
> page/column
> > > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> can
> > > differ substantially from the plain encoding size due to RLE/Dictionary
> > > encoding.
> > >
> > > When doing query planning/execution it can be useful to understand the
> > > total raw size of bytes (e.g. whether to do a broad-cast join).
> > >
> > > Would people be open to adding an optional field that records the
> estimated
> > > (or exact) size of the column if plain encoding had been used?
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > > [2]
> > >
> > >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> > >
> >
>

Re: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Reply via email to