+1 For uncompressed size for the field. However, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met: 1. Null variables. In Arrow Array, null-value should occupy some place, but field-raw size cannot represent that value. 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or variable-size-summary + sizeof(ByteArray) * value-count 3. Some time Arrow data is not equal to Parquet data, like Decimal stored as int32 or int64. Hope that helps.
Best, Xuwei Fu On 2023/03/24 16:26:51 Micah Kornfield wrote: > Parquet metadata currently tracks uncompressed and compressed page/column > sizes [1][2]. Uncompressed size here corresponds to encoded size which can > differ substantially from the plain encoding size due to RLE/Dictionary > encoding. > > When doing query planning/execution it can be useful to understand the > total raw size of bytes (e.g. whether to do a broad-cast join). > > Would people be open to adding an optional field that records the estimated > (or exact) size of the column if plain encoding had been used? > > Thanks, > Micah > > [1] > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728 > [2] > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637 >