Has anyone spent time optimizing the thrift decoder (e.g. not just use
whatever a general purpose thrift compiler generates, but custom code a
parser just for Parquet metadata)?

Ed is in the process of implementing just such a decoder in arrow-rs[1] and
has seen a 2-3x performance improvement (with no change to the format) in
early benchmark results. This is inline with our earlier work on the
topic[2] where we estimated there is a 2-4x performance improvement with
implementation improvements alone.

Andrew

[1]: https://github.com/apache/arrow-rs/issues/5854
[2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/

On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi again,
>
> Ok, a quick summary of my current feedback on this:
>
> - decoding speed measurements are given, but not footer size
>   measurements; it would be interesting to have both
>
> - it's not obvious whether the stated numbers are for reading all
>   columns or a subset of them
>
> - optional LZ4 compression is mentioned, but no numbers are given for
>   it; it would be nice if numbers were available for both uncompressed
>   and compressed footers
>
> - the numbers seem quite underwhelming currently, I think most of us
>   were expecting massive speed improvements given past discussions
>
> - I'm firmly against narrowing sizes to 32 bits; making the footer more
>   compact is useful, but not to the point of reducing usefulness or
>   generality
>
>
> A more general proposal: given the slightly underwhelming perf
> numbers, has nested Flatbuffers been considered as an alternative?
>
> For example, the RowGroup table could become:
> ```
> table ColumnChunk {
>   file_path: string;
>   meta_data: ColumnMetadata;
>   // etc.
> }
>
> struct EncodedColumnChunk {
>   // Flatbuffers-encoded ColumnChunk, to be decoded/validated indidually
>   column: [ubyte];
> }
>
> table RowGroup {
>   columns: [EncodedColumnChunk];
>   total_byte_size: int;
>   num_rows: int;
>   sorting_columns: [SortingColumn];
>   file_offset: long;
>   total_compressed_size: int;
>   ordinal: short = null;
> }
> ```
>
> Regards
>
> Antoine.
>
>
>
> On Thu, 11 Sep 2025 08:41:34 +0200
> Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.INVALID>
> wrote:
> > Hi all. I am sharing as a separate thread the proposal for the footer
> > change we have been working on:
> >
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> > .
> >
> > The proposal outlines the technical aspects of the design and the
> > experimental results of shadow testing this in production workloads. I
> > would like to discuss the proposal's most salient points in the next
> sync:
> > 1. the use of flatbuffers as footer serialization format
> > 2. the additional limitations imposed on parquet files (row group size
> > limit, row group max num row limit)
> >
> > I would prefer comments on the google doc to facilitate async discussion.
> >
> > Thank you,
> >
>
>
>
>

Reply via email to