FYI https://github.com/apache/parquet-format/pull/197 captures the proposal and has gone through a few rounds of reviews. A quick summary: This adds the following optional metadata: 1. Variable byte count for BYTE_ARRAY types at the column chunk. 2. A histogram of repetition and definition levels for the column chunk 3. A histogram of repetition and definition levels to the page index.
We plan on merging by end of week if there isn't further feedback. On Mon, Mar 27, 2023 at 2:23 PM wish maple <[email protected]> wrote: > +1 For uncompressed size for the field. However, it's a bit-tricky here. > I've > implement a similar size-hint in our system, here are some problems I met: > 1. Null variables. In Arrow Array, null-value should occupy some place, but > field-raw size cannot represent that value. > 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or > variable-size-summary + sizeof(ByteArray) * value-count > 3. Some time Arrow data is not equal to Parquet data, like Decimal stored > as int32 or int64. > Hope that helps. > > Best, Xuwei Fu > > On 2023/03/24 16:26:51 Micah Kornfield wrote: > > Parquet metadata currently tracks uncompressed and compressed page/column > > sizes [1][2]. Uncompressed size here corresponds to encoded size which > can > > differ substantially from the plain encoding size due to RLE/Dictionary > > encoding. > > > > When doing query planning/execution it can be useful to understand the > > total raw size of bytes (e.g. whether to do a broad-cast join). > > > > Would people be open to adding an optional field that records the > estimated > > (or exact) size of the column if plain encoding had been used? > > > > Thanks, > > Micah > > > > [1] > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728 > > [2] > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637 > > >
