+1 for adding the raw size of each column into the Parquet specs.

I used to work around these by adding similar but hidden fields to the file
formats.

Let me bring some detailed questions to the table.
1. How primitive types are computed? Should we simply compute the raw size
by assuming the data is plain-encoded?
    For example, does INT16 use the same bit-width as INT32?
    What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
sizeof(int32) for its length?
2. How do we take care of null values? Should we add the size of validity
bitmap or null buffer to the raw size?
3. What about complex types?
    Actually only leaf columns have data in the Parquet file. Should we use
the sum of all sub columns to be the raw size of a nested column?
4. Where to store these raw sizes?
    Add it to the PageHeader? Or should we aggregate it in the
ColumnChunkMetaData?

Best,
Gang

On Sat, Mar 25, 2023 at 12:59 AM Will Jones <[email protected]> wrote:

> Hi Micah,
>
> We were just discussing in the Arrow repo how useful it would be to have
> utilities that could accurately estimate the deserialized size of a Parquet
> file. [1] So I would be very supportive of this.
>
> IIUC the implementation of this should be trivial for many fixed-size
> types, although there may be cases that are more complex to track. I'd
> definitely be interested to hear from folks who have worked on the
> implementations for the other size fields what the level of difficulty is
> to implement such a field.
>
> Best,
>
> Will Jones
>
>  [1] https://github.com/apache/arrow/issues/34712
>
> On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield <[email protected]>
> wrote:
>
> > Parquet metadata currently tracks uncompressed and compressed page/column
> > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> can
> > differ substantially from the plain encoding size due to RLE/Dictionary
> > encoding.
> >
> > When doing query planning/execution it can be useful to understand the
> > total raw size of bytes (e.g. whether to do a broad-cast join).
> >
> > Would people be open to adding an optional field that records the
> estimated
> > (or exact) size of the column if plain encoding had been used?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > [2]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> >
>

Reply via email to