Re: [DISCUSS] new Parquet footer experiments

Alkis Evlogimenos Thu, 15 Aug 2024 03:50:12 -0700

The difference in parsing speed between thrift and flatbuffer is >25x.
Thrift has some fundamental design decisions that make decoding slow:
1. the thrift compact protocol is very data dependent: uleb encoding for
integers, field ids are deltas from previous. The data dependencies
disallow pipelining of modern cpus
2. object model does not have a way to use arenas to avoid many allocations
of objects
If we keep thrift, we can potentially get 2 fixed, but fixing 1 requires
changes to the thrift serialization protocol. Such a change is not
different from switching serialization format.



On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <[email protected]> wrote:

> I wanted to share some work Xiangpeng Hao did at InfluxData this summer on
> the current (thrift) metadata format[1].
>
> We found that with careful software engineering, we could likely improve
> the speed of reading existing parquet footer format by a factor of 4 or
> more ([2] contains some specific ideas). While we analyzed the
> Rust implementation, I believe a similar conclusion applies to C/C++.
>
> I realize that there are certain features that switching to an entirely new
> footer format would achieve, but the cost to adopting a new format
> across the ecosystem is immense (e.g. Parquet "version 2.0" etc).
>
> It is my opinion that investing the same effort in software optimization
> that would be required for a new footer format would have a much bigger
> impact
>
> Andrew
>
> [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> [2]: https://github.com/apache/arrow-rs/issues/5853
>
> On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos
> <[email protected]> wrote:
>
> > Hi Julien.
> >
> > Thank you for reconnecting the threads.
> >
> > I have broken down my experiments in a narrative, commit by commit on how
> > we can go from flatbuffers being ~2x larger than thrift to being smaller
> > (and at times even half) the size of thrift. This is still on an internal
> > branch, I will resume work towards the end of this month to port it to
> > arrow so that folks can look at the progress and share ideas.
> >
> > On the benchmarking front I need to build and share a binary for third
> > parties to donate their footers for analysis.
> >
> > The PR for parquet extensions has gotten a few rounds of reviews:
> > https://github.com/apache/parquet-format/pull/254. I hope it will be
> > merged
> > soon.
> >
> > I missed the sync yesterday - for some reason I didn't receive an
> > invitation. Julien could you add me again to the invite list?
> >
> > On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <[email protected]> wrote:
> >
> > > This came up in the sync today.
> > >
> > > There are a few concurrent experiments with flatbuffers for a future
> > > Parquet footer replacement. In itself it is fine and just wanted to
> > > reconnect the threads here so that folks are aware of each other and
> can
> > > share findings.
> > >
> > > - Neelaksh benchmarking and experiments:
> > >
> > >
> >
> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
> > > https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
> > >
> > > - Alkis has also been experimenting and led the proposal for enabling
> > > extending the existing footer.
> > >
> > >
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > >
> > > - Xuwei also shared that he is looking into this.
> > >
> > > I would suggest that you all reply to this thread sharing your current
> > > progress or ideas and a link to your respective repos for
> experimenting.
> > >
> > > Best
> > > Julien
> > >
> >
>

Re: [DISCUSS] new Parquet footer experiments

Reply via email to