I don't disagree that flatbuffers would be faster than thrift decoding

I am trying to say that with software engineering only (no change to the
format) it is likely possible to increase parquet thrift metadata parsing
speed by 4x.

This is not 25x of course, but 4x is non trivial.

The fact that no one yet has bothered to invest the time to get the 4x yet
in open source implementations of parquet suggests to me that the parsing
time may not be as critical an issue as we think

Andrew

On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos
<[email protected]> wrote:

> The difference in parsing speed between thrift and flatbuffer is >25x.
> Thrift has some fundamental design decisions that make decoding slow:
> 1. the thrift compact protocol is very data dependent: uleb encoding for
> integers, field ids are deltas from previous. The data dependencies
> disallow pipelining of modern cpus
> 2. object model does not have a way to use arenas to avoid many allocations
> of objects
> If we keep thrift, we can potentially get 2 fixed, but fixing 1 requires
> changes to the thrift serialization protocol. Such a change is not
> different from switching serialization format.
>
>
> On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <[email protected]>
> wrote:
>
> > I wanted to share some work Xiangpeng Hao did at InfluxData this summer
> on
> > the current (thrift) metadata format[1].
> >
> > We found that with careful software engineering, we could likely improve
> > the speed of reading existing parquet footer format by a factor of 4 or
> > more ([2] contains some specific ideas). While we analyzed the
> > Rust implementation, I believe a similar conclusion applies to C/C++.
> >
> > I realize that there are certain features that switching to an entirely
> new
> > footer format would achieve, but the cost to adopting a new format
> > across the ecosystem is immense (e.g. Parquet "version 2.0" etc).
> >
> > It is my opinion that investing the same effort in software optimization
> > that would be required for a new footer format would have a much bigger
> > impact
> >
> > Andrew
> >
> > [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> > [2]: https://github.com/apache/arrow-rs/issues/5853
> >
> > On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos
> > <[email protected]> wrote:
> >
> > > Hi Julien.
> > >
> > > Thank you for reconnecting the threads.
> > >
> > > I have broken down my experiments in a narrative, commit by commit on
> how
> > > we can go from flatbuffers being ~2x larger than thrift to being
> smaller
> > > (and at times even half) the size of thrift. This is still on an
> internal
> > > branch, I will resume work towards the end of this month to port it to
> > > arrow so that folks can look at the progress and share ideas.
> > >
> > > On the benchmarking front I need to build and share a binary for third
> > > parties to donate their footers for analysis.
> > >
> > > The PR for parquet extensions has gotten a few rounds of reviews:
> > > https://github.com/apache/parquet-format/pull/254. I hope it will be
> > > merged
> > > soon.
> > >
> > > I missed the sync yesterday - for some reason I didn't receive an
> > > invitation. Julien could you add me again to the invite list?
> > >
> > > On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <[email protected]>
> wrote:
> > >
> > > > This came up in the sync today.
> > > >
> > > > There are a few concurrent experiments with flatbuffers for a future
> > > > Parquet footer replacement. In itself it is fine and just wanted to
> > > > reconnect the threads here so that folks are aware of each other and
> > can
> > > > share findings.
> > > >
> > > > - Neelaksh benchmarking and experiments:
> > > >
> > > >
> > >
> >
> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
> > > > https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
> > > >
> > > > - Alkis has also been experimenting and led the proposal for enabling
> > > > extending the existing footer.
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > > >
> > > > - Xuwei also shared that he is looking into this.
> > > >
> > > > I would suggest that you all reply to this thread sharing your
> current
> > > > progress or ideas and a link to your respective repos for
> > experimenting.
> > > >
> > > > Best
> > > > Julien
> > > >
> > >
> >
>

Reply via email to