>
> 1. How primitive types are computed? Should we simply compute the raw size
> by assuming the data is plain-encoded?
>     For example, does INT16 use the same bit-width as INT32?
>     What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
> sizeof(int32) for its length?

Yes my suggestion is raw size assuming it is plain encoded.  INT16 has the
same size as int32.  In general fixed width types it is easy to back out
actual byte size in memory given the number of values stored and the number
of null values.  For Byte Array this means we store 4 bytes for every
non-null value.  For FIXED_SIZE_BYTE_ARRAY plain encoding would have 4
bytes for every value so my suggestion is yes we add in the size overhead.
Again size overhead can be backed out given number nulls and number of
values.  Given this for


> 2. How do we take care of null values? Should we add the size of validity
> bitmap or null buffer to the raw size?

No, I think this can be inferred from metadata and consumers can calculate
the space they think this would take in their memory representation.   Open
to thoughts here but it seems standardizing on plain encoding is the
easiest for people to understand and adapt the estimate to what they
actually care about, but I can see the other side of simplifying the
computation for systems consuming parquet, so happy to go either way.


> 3. What about complex types?
>     Actually only leaf columns have data in the Parquet file. Should we use
> the sum of all sub columns to be the raw size of a nested column?


My preference would keep this on leaf columns.  This leads to two
complications:
1.  Accurately estimated cost of group values (structs).  This should be
reverse engineerable if the number of records in the page/column chunk are
known (i.e. size estimates do not account for group values, and reader
would calculate based on the number of rows).  This might be a good reason
to try to get data page v2 into shape or back-port the number of started
records into a data page v1.

2.  For repeated values, I think it is sufficient to get a reasonable
estimate to know the number of start arrays (this includes nested arrays)
contained in a page/column chunk and we can add a new field to record this
separately.

Thoughts?

4. Where to store these raw sizes?

>     Add it to the PageHeader? Or should we aggregate it in the
> ColumnChunkMetaData?

I would suggest adding it to both (IIUC we store uncompressed size and
other values in both as well).

Thanks,
Micah

On Sat, Mar 25, 2023 at 7:23 AM Gang Wu <[email protected]> wrote:

> +1 for adding the raw size of each column into the Parquet specs.
>
> I used to work around these by adding similar but hidden fields to the file
> formats.
>
> Let me bring some detailed questions to the table.
> 1. How primitive types are computed? Should we simply compute the raw size
> by assuming the data is plain-encoded?
>     For example, does INT16 use the same bit-width as INT32?
>     What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
> sizeof(int32) for its length?
> 2. How do we take care of null values? Should we add the size of validity
> bitmap or null buffer to the raw size?
> 3. What about complex types?
>     Actually only leaf columns have data in the Parquet file. Should we use
> the sum of all sub columns to be the raw size of a nested column?
> 4. Where to store these raw sizes?
>     Add it to the PageHeader? Or should we aggregate it in the
> ColumnChunkMetaData?
>
> Best,
> Gang
>
> On Sat, Mar 25, 2023 at 12:59 AM Will Jones <[email protected]>
> wrote:
>
> > Hi Micah,
> >
> > We were just discussing in the Arrow repo how useful it would be to have
> > utilities that could accurately estimate the deserialized size of a
> Parquet
> > file. [1] So I would be very supportive of this.
> >
> > IIUC the implementation of this should be trivial for many fixed-size
> > types, although there may be cases that are more complex to track. I'd
> > definitely be interested to hear from folks who have worked on the
> > implementations for the other size fields what the level of difficulty is
> > to implement such a field.
> >
> > Best,
> >
> > Will Jones
> >
> >  [1] https://github.com/apache/arrow/issues/34712
> >
> > On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > Parquet metadata currently tracks uncompressed and compressed
> page/column
> > > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> > can
> > > differ substantially from the plain encoding size due to RLE/Dictionary
> > > encoding.
> > >
> > > When doing query planning/execution it can be useful to understand the
> > > total raw size of bytes (e.g. whether to do a broad-cast join).
> > >
> > > Would people be open to adding an optional field that records the
> > estimated
> > > (or exact) size of the column if plain encoding had been used?
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > > [2]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> > >
> >
>

Reply via email to