Has anyone spent time optimizing the thrift decoder (e.g. not just use whatever a general purpose thrift compiler generates, but custom code a parser just for Parquet metadata)?
Ed is in the process of implementing just such a decoder in arrow-rs[1] and has seen a 2-3x performance improvement (with no change to the format) in early benchmark results. This is inline with our earlier work on the topic[2] where we estimated there is a 2-4x performance improvement with implementation improvements alone. Andrew [1]: https://github.com/apache/arrow-rs/issues/5854 [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou <anto...@python.org> wrote: > > Hi again, > > Ok, a quick summary of my current feedback on this: > > - decoding speed measurements are given, but not footer size > measurements; it would be interesting to have both > > - it's not obvious whether the stated numbers are for reading all > columns or a subset of them > > - optional LZ4 compression is mentioned, but no numbers are given for > it; it would be nice if numbers were available for both uncompressed > and compressed footers > > - the numbers seem quite underwhelming currently, I think most of us > were expecting massive speed improvements given past discussions > > - I'm firmly against narrowing sizes to 32 bits; making the footer more > compact is useful, but not to the point of reducing usefulness or > generality > > > A more general proposal: given the slightly underwhelming perf > numbers, has nested Flatbuffers been considered as an alternative? > > For example, the RowGroup table could become: > ``` > table ColumnChunk { > file_path: string; > meta_data: ColumnMetadata; > // etc. > } > > struct EncodedColumnChunk { > // Flatbuffers-encoded ColumnChunk, to be decoded/validated indidually > column: [ubyte]; > } > > table RowGroup { > columns: [EncodedColumnChunk]; > total_byte_size: int; > num_rows: int; > sorting_columns: [SortingColumn]; > file_offset: long; > total_compressed_size: int; > ordinal: short = null; > } > ``` > > Regards > > Antoine. > > > > On Thu, 11 Sep 2025 08:41:34 +0200 > Alkis Evlogimenos > <alkis.evlogime...@databricks.com.INVALID> > wrote: > > Hi all. I am sharing as a separate thread the proposal for the footer > > change we have been working on: > > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit > > . > > > > The proposal outlines the technical aspects of the design and the > > experimental results of shadow testing this in production workloads. I > > would like to discuss the proposal's most salient points in the next > sync: > > 1. the use of flatbuffers as footer serialization format > > 2. the additional limitations imposed on parquet files (row group size > > limit, row group max num row limit) > > > > I would prefer comments on the google doc to facilitate async discussion. > > > > Thank you, > > > > > >