Hi Andrew,

I haven't heard of anything like this for C++, but it is an intriguing
idea.

Regards

Antoine.


On Tue, 16 Sep 2025 16:44:14 -0400
Andrew Lamb <[email protected]>
wrote:
> Has anyone spent time optimizing the thrift decoder (e.g. not just use
> whatever a general purpose thrift compiler generates, but custom code a
> parser just for Parquet metadata)?
> 
> Ed is in the process of implementing just such a decoder in arrow-rs[1] and
> has seen a 2-3x performance improvement (with no change to the format) in
> early benchmark results. This is inline with our earlier work on the
> topic[2] where we estimated there is a 2-4x performance improvement with
> implementation improvements alone.
> 
> Andrew
> 
> [1]: https://github.com/apache/arrow-rs/issues/5854
> [2]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> 
> On Tue, Sep 16, 2025 at 4:12 AM Antoine Pitrou 
> <[email protected]> wrote:
> 
> >
> > Hi again,
> >
> > Ok, a quick summary of my current feedback on this:
> >
> > - decoding speed measurements are given, but not footer size
> >   measurements; it would be interesting to have both
> >
> > - it's not obvious whether the stated numbers are for reading all
> >   columns or a subset of them
> >
> > - optional LZ4 compression is mentioned, but no numbers are given for
> >   it; it would be nice if numbers were available for both uncompressed
> >   and compressed footers
> >
> > - the numbers seem quite underwhelming currently, I think most of us
> >   were expecting massive speed improvements given past discussions
> >
> > - I'm firmly against narrowing sizes to 32 bits; making the footer more
> >   compact is useful, but not to the point of reducing usefulness or
> >   generality
> >
> >
> > A more general proposal: given the slightly underwhelming perf
> > numbers, has nested Flatbuffers been considered as an alternative?
> >
> > For example, the RowGroup table could become:
> > ```
> > table ColumnChunk {
> >   file_path: string;
> >   meta_data: ColumnMetadata;
> >   // etc.
> > }
> >
> > struct EncodedColumnChunk {
> >   // Flatbuffers-encoded ColumnChunk, to be decoded/validated indidually
> >   column: [ubyte];
> > }
> >
> > table RowGroup {
> >   columns: [EncodedColumnChunk];
> >   total_byte_size: int;
> >   num_rows: int;
> >   sorting_columns: [SortingColumn];
> >   file_offset: long;
> >   total_compressed_size: int;
> >   ordinal: short = null;
> > }
> > ```
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > On Thu, 11 Sep 2025 08:41:34 +0200
> > Alkis Evlogimenos
> > <[email protected]>
> > wrote:  
> > > Hi all. I am sharing as a separate thread the proposal for the footer
> > > change we have been working on:
> > >  
> > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> >   
> > > .
> > >
> > > The proposal outlines the technical aspects of the design and the
> > > experimental results of shadow testing this in production workloads. I
> > > would like to discuss the proposal's most salient points in the next  
> > sync:  
> > > 1. the use of flatbuffers as footer serialization format
> > > 2. the additional limitations imposed on parquet files (row group size
> > > limit, row group max num row limit)
> > >
> > > I would prefer comments on the google doc to facilitate async discussion.
> > >
> > > Thank you,
> > >  
> >
> >
> >
> >  
> 



Reply via email to