subject:"\[DISCUSS\] Add a Plain Encoding Size Bytes to Parquet Metadata"

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-04-19 Thread Micah Kornfield

FYI https://github.com/apache/parquet-format/pull/197 captures the proposal
and has gone through a few rounds of reviews.  A quick summary:
This adds the following optional metadata:
1.  Variable byte count for BYTE_ARRAY types at the column chunk.
2.  A histogram of repetition and definition levels for the column chunk
3.  A histogram of repetition and definition levels to the page index.

We plan on merging by end of week if there isn't further feedback.



On Mon, Mar 27, 2023 at 2:23 PM wish maple  wrote:

> +1 For uncompressed size for the field. However, it's a bit-tricky here.
> I've
> implement a similar size-hint in our system, here are some problems I met:
> 1. Null variables. In Arrow Array, null-value should occupy some place, but
> field-raw size cannot represent that value.
> 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
> variable-size-summary + sizeof(ByteArray) * value-count
> 3. Some time Arrow data is not equal to Parquet data, like Decimal stored
> as int32 or int64.
> Hope that helps.
>
> Best, Xuwei Fu
>
> On 2023/03/24 16:26:51 Micah Kornfield wrote:
> > Parquet metadata currently tracks uncompressed and compressed page/column
> > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> can
> > differ substantially from the plain encoding size due to RLE/Dictionary
> > encoding.
> >
> > When doing query planning/execution it can be useful to understand the
> > total raw size of bytes (e.g. whether to do a broad-cast join).
> >
> > Would people be open to adding an optional field that records the
> estimated
> > (or exact) size of the column if plain encoding had been used?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > [2]
> >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> >
>

RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-27 Thread wish maple

+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
variable-size-summary + sizeof(ByteArray) * value-count
3. Some time Arrow data is not equal to Parquet data, like Decimal stored
as int32 or int64.
Hope that helps.

Best, Xuwei Fu

On 2023/03/24 16:26:51 Micah Kornfield wrote:
> Parquet metadata currently tracks uncompressed and compressed page/column
> sizes [1][2].  Uncompressed size here corresponds to encoded size which
can
> differ substantially from the plain encoding size due to RLE/Dictionary
> encoding.
>
> When doing query planning/execution it can be useful to understand the
> total raw size of bytes (e.g. whether to do a broad-cast join).
>
> Would people be open to adding an optional field that records the
estimated
> (or exact) size of the column if plain encoding had been used?
>
> Thanks,
> Micah
>
> [1]
>
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> [2]
>
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
>

RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-27 Thread wish maple

+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
variable-size-summary + sizeof(ByteArray) * value-count
3. Some times Arrow data is not equal to Parquet data, like Decimal stored
as int32 or int64.
Hope that helps.

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield

I put together a draft PR:
https://github.com/apache/parquet-format/pull/197/files

Thinking about the nulls and nesting level a bit more I think keeping a
historgram of repetition and definition levels probably strikes the right
balance for simplicity and accuracy but it would be great to here if there
are other approaches.

On Sat, Mar 25, 2023 at 5:37 PM Micah Kornfield 
wrote:

> 2.  For repeated values, I think it is sufficient to get a reasonable
>> estimate to know the number of start arrays (this includes nested arrays)
>> contained in a page/column chunk and we can add a new field to record this
>> separately.
>
>
> Apologies for replying to myself but one more thought, I don't think any
> set of simple statistics is going to be able to give accurate memory
> estimates for combinations of repeated and nested fields and the intent of
> my proposal is more to get meaningful estimates rather than exact
> measures..  A more exact approach for deeply nested and repeated field
> could be a repeated struct of {rep-level, number of array starts,  total
> number of elements contained in arrays at this level}.  This would still be
> reasonably compact (1 per rep-level) but seems too complex to me.  If
> people think the complexity isn't there or provides meaningful value it
> seems like a plausible approach, there might be others.
>
> On Sat, Mar 25, 2023 at 4:59 PM Micah Kornfield 
> wrote:
>
>> 1. How primitive types are computed? Should we simply compute the raw size
>>> by assuming the data is plain-encoded?
>>> For example, does INT16 use the same bit-width as INT32?
>>> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
>>> sizeof(int32) for its length?
>>
>> Yes my suggestion is raw size assuming it is plain encoded.  INT16 has
>> the same size as int32.  In general fixed width types it is easy to back
>> out actual byte size in memory given the number of values stored and the
>> number of null values.  For Byte Array this means we store 4 bytes for
>> every non-null value.  For FIXED_SIZE_BYTE_ARRAY plain encoding would have
>> 4 bytes for every value so my suggestion is yes we add in the size
>> overhead.  Again size overhead can be backed out given number nulls and
>> number of values.  Given this for
>>
>>
>>> 2. How do we take care of null values? Should we add the size of validity
>>> bitmap or null buffer to the raw size?
>>
>> No, I think this can be inferred from metadata and consumers can
>> calculate the space they think this would take in their memory
>> representation.   Open to thoughts here but it seems standardizing on plain
>> encoding is the easiest for people to understand and adapt the estimate to
>> what they actually care about, but I can see the other side of simplifying
>> the computation for systems consuming parquet, so happy to go either way.
>>
>>
>>> 3. What about complex types?
>>> Actually only leaf columns have data in the Parquet file. Should we
>>> use
>>> the sum of all sub columns to be the raw size of a nested column?
>>
>>
>> My preference would keep this on leaf columns.  This leads to two
>> complications:
>> 1.  Accurately estimated cost of group values (structs).  This should be
>> reverse engineerable if the number of records in the page/column chunk are
>> known (i.e. size estimates do not account for group values, and reader
>> would calculate based on the number of rows).  This might be a good reason
>> to try to get data page v2 into shape or back-port the number of started
>> records into a data page v1.
>>
>> 2.  For repeated values, I think it is sufficient to get a reasonable
>> estimate to know the number of start arrays (this includes nested arrays)
>> contained in a page/column chunk and we can add a new field to record this
>> separately.
>>
>> Thoughts?
>>
>> 4. Where to store these raw sizes?
>>
>>> Add it to the PageHeader? Or should we aggregate it in the
>>> ColumnChunkMetaData?
>>
>> I would suggest adding it to both (IIUC we store uncompressed size and
>> other values in both as well).
>>
>> Thanks,
>> Micah
>>
>> On Sat, Mar 25, 2023 at 7:23 AM Gang Wu  wrote:
>>
>>> +1 for adding the raw size of each column into the Parquet specs.
>>>
>>> I used to work around these by adding similar but hidden fields to the
>>> file
>>> formats.
>>>
>>> Let me bring some detailed questions to the table.
>>> 1. How primitive types are computed? Should we simply compute the raw
>>> size
>>> by assuming the data is plain-encoded?
>>> For example, does INT16 use the same bit-width as INT32?
>>> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
>>> sizeof(int32) for its length?
>>> 2. How do we take care of null values? Should we add the size of validity
>>> bitmap or null buffer to the raw size?
>>> 3. What about complex types?
>>> Actually only leaf columns have data in the Parquet file. Should we
>>> use
>>> the sum of all sub columns to be the raw size of a nested column?
>>> 4. Where to

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield

>
> 2.  For repeated values, I think it is sufficient to get a reasonable
> estimate to know the number of start arrays (this includes nested arrays)
> contained in a page/column chunk and we can add a new field to record this
> separately.


Apologies for replying to myself but one more thought, I don't think any
set of simple statistics is going to be able to give accurate memory
estimates for combinations of repeated and nested fields and the intent of
my proposal is more to get meaningful estimates rather than exact
measures..  A more exact approach for deeply nested and repeated field
could be a repeated struct of {rep-level, number of array starts,  total
number of elements contained in arrays at this level}.  This would still be
reasonably compact (1 per rep-level) but seems too complex to me.  If
people think the complexity isn't there or provides meaningful value it
seems like a plausible approach, there might be others.

On Sat, Mar 25, 2023 at 4:59 PM Micah Kornfield 
wrote:

> 1. How primitive types are computed? Should we simply compute the raw size
>> by assuming the data is plain-encoded?
>> For example, does INT16 use the same bit-width as INT32?
>> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
>> sizeof(int32) for its length?
>
> Yes my suggestion is raw size assuming it is plain encoded.  INT16 has the
> same size as int32.  In general fixed width types it is easy to back out
> actual byte size in memory given the number of values stored and the number
> of null values.  For Byte Array this means we store 4 bytes for every
> non-null value.  For FIXED_SIZE_BYTE_ARRAY plain encoding would have 4
> bytes for every value so my suggestion is yes we add in the size overhead.
> Again size overhead can be backed out given number nulls and number of
> values.  Given this for
>
>
>> 2. How do we take care of null values? Should we add the size of validity
>> bitmap or null buffer to the raw size?
>
> No, I think this can be inferred from metadata and consumers can calculate
> the space they think this would take in their memory representation.   Open
> to thoughts here but it seems standardizing on plain encoding is the
> easiest for people to understand and adapt the estimate to what they
> actually care about, but I can see the other side of simplifying the
> computation for systems consuming parquet, so happy to go either way.
>
>
>> 3. What about complex types?
>> Actually only leaf columns have data in the Parquet file. Should we
>> use
>> the sum of all sub columns to be the raw size of a nested column?
>
>
> My preference would keep this on leaf columns.  This leads to two
> complications:
> 1.  Accurately estimated cost of group values (structs).  This should be
> reverse engineerable if the number of records in the page/column chunk are
> known (i.e. size estimates do not account for group values, and reader
> would calculate based on the number of rows).  This might be a good reason
> to try to get data page v2 into shape or back-port the number of started
> records into a data page v1.
>
> 2.  For repeated values, I think it is sufficient to get a reasonable
> estimate to know the number of start arrays (this includes nested arrays)
> contained in a page/column chunk and we can add a new field to record this
> separately.
>
> Thoughts?
>
> 4. Where to store these raw sizes?
>
>> Add it to the PageHeader? Or should we aggregate it in the
>> ColumnChunkMetaData?
>
> I would suggest adding it to both (IIUC we store uncompressed size and
> other values in both as well).
>
> Thanks,
> Micah
>
> On Sat, Mar 25, 2023 at 7:23 AM Gang Wu  wrote:
>
>> +1 for adding the raw size of each column into the Parquet specs.
>>
>> I used to work around these by adding similar but hidden fields to the
>> file
>> formats.
>>
>> Let me bring some detailed questions to the table.
>> 1. How primitive types are computed? Should we simply compute the raw size
>> by assuming the data is plain-encoded?
>> For example, does INT16 use the same bit-width as INT32?
>> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
>> sizeof(int32) for its length?
>> 2. How do we take care of null values? Should we add the size of validity
>> bitmap or null buffer to the raw size?
>> 3. What about complex types?
>> Actually only leaf columns have data in the Parquet file. Should we
>> use
>> the sum of all sub columns to be the raw size of a nested column?
>> 4. Where to store these raw sizes?
>> Add it to the PageHeader? Or should we aggregate it in the
>> ColumnChunkMetaData?
>>
>> Best,
>> Gang
>>
>> On Sat, Mar 25, 2023 at 12:59 AM Will Jones 
>> wrote:
>>
>> > Hi Micah,
>> >
>> > We were just discussing in the Arrow repo how useful it would be to have
>> > utilities that could accurately estimate the deserialized size of a
>> Parquet
>> > file. [1] So I would be very supportive of this.
>> >
>> > IIUC the implementation of this should be trivial for many

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield

>
> 1. How primitive types are computed? Should we simply compute the raw size
> by assuming the data is plain-encoded?
> For example, does INT16 use the same bit-width as INT32?
> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
> sizeof(int32) for its length?

Yes my suggestion is raw size assuming it is plain encoded.  INT16 has the
same size as int32.  In general fixed width types it is easy to back out
actual byte size in memory given the number of values stored and the number
of null values.  For Byte Array this means we store 4 bytes for every
non-null value.  For FIXED_SIZE_BYTE_ARRAY plain encoding would have 4
bytes for every value so my suggestion is yes we add in the size overhead.
Again size overhead can be backed out given number nulls and number of
values.  Given this for

> 2. How do we take care of null values? Should we add the size of validity
> bitmap or null buffer to the raw size?

No, I think this can be inferred from metadata and consumers can calculate
the space they think this would take in their memory representation.   Open
to thoughts here but it seems standardizing on plain encoding is the
easiest for people to understand and adapt the estimate to what they
actually care about, but I can see the other side of simplifying the
computation for systems consuming parquet, so happy to go either way.

> 3. What about complex types?
> Actually only leaf columns have data in the Parquet file. Should we use
> the sum of all sub columns to be the raw size of a nested column?

My preference would keep this on leaf columns.  This leads to two
complications:
1.  Accurately estimated cost of group values (structs).  This should be
reverse engineerable if the number of records in the page/column chunk are
known (i.e. size estimates do not account for group values, and reader
would calculate based on the number of rows).  This might be a good reason
to try to get data page v2 into shape or back-port the number of started
records into a data page v1.

2.  For repeated values, I think it is sufficient to get a reasonable
estimate to know the number of start arrays (this includes nested arrays)
contained in a page/column chunk and we can add a new field to record this
separately.

Thoughts?

4. Where to store these raw sizes?

> Add it to the PageHeader? Or should we aggregate it in the
> ColumnChunkMetaData?

I would suggest adding it to both (IIUC we store uncompressed size and
other values in both as well).

Thanks,
Micah

On Sat, Mar 25, 2023 at 7:23 AM Gang Wu  wrote:

> +1 for adding the raw size of each column into the Parquet specs.
>
> I used to work around these by adding similar but hidden fields to the file
> formats.
>
> Let me bring some detailed questions to the table.
> 1. How primitive types are computed? Should we simply compute the raw size
> by assuming the data is plain-encoded?
> For example, does INT16 use the same bit-width as INT32?
> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
> sizeof(int32) for its length?
> 2. How do we take care of null values? Should we add the size of validity
> bitmap or null buffer to the raw size?
> 3. What about complex types?
> Actually only leaf columns have data in the Parquet file. Should we use
> the sum of all sub columns to be the raw size of a nested column?
> 4. Where to store these raw sizes?
> Add it to the PageHeader? Or should we aggregate it in the
> ColumnChunkMetaData?
>
> Best,
> Gang
>
> On Sat, Mar 25, 2023 at 12:59 AM Will Jones 
> wrote:
>
> > Hi Micah,
> >
> > We were just discussing in the Arrow repo how useful it would be to have
> > utilities that could accurately estimate the deserialized size of a
> Parquet
> > file. [1] So I would be very supportive of this.
> >
> > IIUC the implementation of this should be trivial for many fixed-size
> > types, although there may be cases that are more complex to track. I'd
> > definitely be interested to hear from folks who have worked on the
> > implementations for the other size fields what the level of difficulty is
> > to implement such a field.
> >
> > Best,
> >
> > Will Jones
> >
> >  [1] https://github.com/apache/arrow/issues/34712
> >
> > On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
> > wrote:
> >
> > > Parquet metadata currently tracks uncompressed and compressed
> page/column
> > > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> > can
> > > differ substantially from the plain encoding size due to RLE/Dictionary
> > > encoding.
> > >
> > > When doing query planning/execution it can be useful to understand the
> > > total raw size of bytes (e.g. whether to do a broad-cast join).
> > >
> > > Would people be open to adding an optional field that records the
> > estimated
> > > (or exact) size of the column if plain encoding had been used?
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > >
> >
>

Re: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield

>
> 1. Null variables. In Arrow Array, null-value should occupy some place, but
> field-raw size cannot represent that value.

This is a good point.  The number of nulls can be inferred from statistics
or is included in data-page v2 [1].  I'd rather not bake in assumptions
about size of nulls as different systems can represent them differently and
I would prefer keep this memory representation agnostic.  I'm open to
thoughts here?


> 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
> variable-size-summary + sizeof(ByteArray) * value-count

My suggestion here is the size of the plain encoded values because the
encoding is already well defined in Parquet.  I think for FLBA this ends up
being equal to ["size of column" * number of non-null values]/. for byte
array the formula listed here is what plain encoding would work out to
where sizeof(ByteArray) = 4.I think this is preferable but let me know
if this doesn't cover your use-case.


> 3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
> as int32 or int64.
> Hope that helps.

Yes, my intent would be to keep this agnostic from other systems but I
think the information allows other systems to use the estimate reasonably
well or back out their own computation.  The size of the Decimal values in
Arrow can determines by precisions and scale of the column, the chosen
Arrow decimal width and the number of values.

Best, Xuwei Fu

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L563

On Sat, Mar 25, 2023 at 9:36 AM wish maple  wrote:

> +1 For uncompressed size for the field. However, it's a bit-tricky here.
> I've
> implement a similar size-hint in our system, here are some problems I met:
> 1. Null variables. In Arrow Array, null-value should occupy some place, but
> field-raw size cannot represent that value.
> 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
> variable-size-summary + sizeof(ByteArray) * value-count
> 3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
> as int32 or int64.
> Hope that helps.
>
> Best, Xuwei Fu
>
> On 2023/03/24 16:59:31 Will Jones wrote:
> > Hi Micah,
> >
> > We were just discussing in the Arrow repo how useful it would be to have
> > utilities that could accurately estimate the deserialized size of a
> Parquet
> > file. [1] So I would be very supportive of this.
> >
> > IIUC the implementation of this should be trivial for many fixed-size
> > types, although there may be cases that are more complex to track. I'd
> > definitely be interested to hear from folks who have worked on the
> > implementations for the other size fields what the level of difficulty is
> > to implement such a field.
> >
> > Best,
> >
> > Will Jones
> >
> >  [1] https://github.com/apache/arrow/issues/34712
> >
> > On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
> > wrote:
> >
> > > Parquet metadata currently tracks uncompressed and compressed
> page/column
> > > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> can
> > > differ substantially from the plain encoding size due to RLE/Dictionary
> > > encoding.
> > >
> > > When doing query planning/execution it can be useful to understand the
> > > total raw size of bytes (e.g. whether to do a broad-cast join).
> > >
> > > Would people be open to adding an optional field that records the
> estimated
> > > (or exact) size of the column if plain encoding had been used?
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > > [2]
> > >
> > >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> > >
> >
>

RE: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread wish maple

+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
variable-size-summary + sizeof(ByteArray) * value-count
3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
as int32 or int64.
Hope that helps.

Best, Xuwei Fu

On 2023/03/24 16:59:31 Will Jones wrote:
> Hi Micah,
>
> We were just discussing in the Arrow repo how useful it would be to have
> utilities that could accurately estimate the deserialized size of a
Parquet
> file. [1] So I would be very supportive of this.
>
> IIUC the implementation of this should be trivial for many fixed-size
> types, although there may be cases that are more complex to track. I'd
> definitely be interested to hear from folks who have worked on the
> implementations for the other size fields what the level of difficulty is
> to implement such a field.
>
> Best,
>
> Will Jones
>
>  [1] https://github.com/apache/arrow/issues/34712
>
> On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
> wrote:
>
> > Parquet metadata currently tracks uncompressed and compressed
page/column
> > sizes [1][2].  Uncompressed size here corresponds to encoded size which
can
> > differ substantially from the plain encoding size due to RLE/Dictionary
> > encoding.
> >
> > When doing query planning/execution it can be useful to understand the
> > total raw size of bytes (e.g. whether to do a broad-cast join).
> >
> > Would people be open to adding an optional field that records the
estimated
> > (or exact) size of the column if plain encoding had been used?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > [2]
> >
> >
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> >
>

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Gang Wu

+1 for adding the raw size of each column into the Parquet specs.

I used to work around these by adding similar but hidden fields to the file
formats.

Let me bring some detailed questions to the table.
1. How primitive types are computed? Should we simply compute the raw size
by assuming the data is plain-encoded?
For example, does INT16 use the same bit-width as INT32?
What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
sizeof(int32) for its length?
2. How do we take care of null values? Should we add the size of validity
bitmap or null buffer to the raw size?
3. What about complex types?
Actually only leaf columns have data in the Parquet file. Should we use
the sum of all sub columns to be the raw size of a nested column?
4. Where to store these raw sizes?
Add it to the PageHeader? Or should we aggregate it in the
ColumnChunkMetaData?

Best,
Gang

On Sat, Mar 25, 2023 at 12:59 AM Will Jones  wrote:

> Hi Micah,
>
> We were just discussing in the Arrow repo how useful it would be to have
> utilities that could accurately estimate the deserialized size of a Parquet
> file. [1] So I would be very supportive of this.
>
> IIUC the implementation of this should be trivial for many fixed-size
> types, although there may be cases that are more complex to track. I'd
> definitely be interested to hear from folks who have worked on the
> implementations for the other size fields what the level of difficulty is
> to implement such a field.
>
> Best,
>
> Will Jones
>
>  [1] https://github.com/apache/arrow/issues/34712
>
> On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
> wrote:
>
> > Parquet metadata currently tracks uncompressed and compressed page/column
> > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> can
> > differ substantially from the plain encoding size due to RLE/Dictionary
> > encoding.
> >
> > When doing query planning/execution it can be useful to understand the
> > total raw size of bytes (e.g. whether to do a broad-cast join).
> >
> > Would people be open to adding an optional field that records the
> estimated
> > (or exact) size of the column if plain encoding had been used?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > [2]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> >
>

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-24 Thread Will Jones

Hi Micah,

We were just discussing in the Arrow repo how useful it would be to have
utilities that could accurately estimate the deserialized size of a Parquet
file. [1] So I would be very supportive of this.

IIUC the implementation of this should be trivial for many fixed-size
types, although there may be cases that are more complex to track. I'd
definitely be interested to hear from folks who have worked on the
implementations for the other size fields what the level of difficulty is
to implement such a field.

Best,

Will Jones

 [1] https://github.com/apache/arrow/issues/34712

On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
wrote:

> Parquet metadata currently tracks uncompressed and compressed page/column
> sizes [1][2].  Uncompressed size here corresponds to encoded size which can
> differ substantially from the plain encoding size due to RLE/Dictionary
> encoding.
>
> When doing query planning/execution it can be useful to understand the
> total raw size of bytes (e.g. whether to do a broad-cast join).
>
> Would people be open to adding an optional field that records the estimated
> (or exact) size of the column if plain encoding had been used?
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> [2]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
>

[DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-24 Thread Micah Kornfield

Parquet metadata currently tracks uncompressed and compressed page/column
sizes [1][2].  Uncompressed size here corresponds to encoded size which can
differ substantially from the plain encoding size due to RLE/Dictionary
encoding.

When doing query planning/execution it can be useful to understand the
total raw size of bytes (e.g. whether to do a broad-cast join).

Would people be open to adding an optional field that records the estimated
(or exact) size of the column if plain encoding had been used?

Thanks,
Micah

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Re: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

RE: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

[DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

11 matches

Site Navigation

Mail list logo

Footer information