One more point that I would like to mention here is that we have put a lot
of effort into REPRODUCIBILITY for this benchmark repo. There have been a
lot of great benchmarking efforts that have been done as part of this
discussion. However, one limitation is that many of the experiments have
not included code or take a fair bit of effort to setup. We've made strong
efforts here using Docker and vcpkg to make the setup for these benchmarks
as transparent and reproducible as possible. Our hope is that this will
provide a useful contribution for others to either reproduce many of the
results that have been discussed or easily run their own experiments when
trying alternatives. We hope this will help facilitate the discussion with
easily shareable experiments.

On Thu, Aug 15, 2024, 9:21 PM Alkis Evlogimenos
<[email protected]> wrote:

> > Alkis, can you elaborate how you brought the size of Flatbuffers down?
>
> I have the internal PR rewritten in separate commits with all the steps. I
> plan to publish it to arrow repo as soon as possible. The heavy things in
> metadata are statistics, offsets, path_in_schema. It takes ~10 steps to cut
> the size down, each of which takes a good chunk of the original size.
>
> On Thu, Aug 15, 2024 at 2:43 PM Jan Finis <[email protected]> wrote:
>
> > I guess most close source implementations have done these optimizations
> > already, it has just not been done in the open source versions. E.g., we
> > switched to a custom-built thrift runtime using pool allocators and
> string
> > views instead of copied strings a few years ago, seeing comparable
> > speed-ups. The C++ thrift library is just horribly inefficient.
> >
> > I agree with Alkis though that there are some gains that can be achieved
> by
> > optimizing, but the format has inherent drawbacks. Flatbuffers is indeed
> > more efficient but at the cost of increased size.
> > Alkis, can you elaborate how you brought the size of Flatbuffers down?
> >
> > Cheers,
> > Jan
> >
> > Am Do., 15. Aug. 2024 um 13:50 Uhr schrieb Andrew Lamb <
> > [email protected]>:
> >
> > > I don't disagree that flatbuffers would be faster than thrift decoding
> > >
> > > I am trying to say that with software engineering only (no change to
> the
> > > format) it is likely possible to increase parquet thrift metadata
> parsing
> > > speed by 4x.
> > >
> > > This is not 25x of course, but 4x is non trivial.
> > >
> > > The fact that no one yet has bothered to invest the time to get the 4x
> > yet
> > > in open source implementations of parquet suggests to me that the
> parsing
> > > time may not be as critical an issue as we think
> > >
> > > Andrew
> > >
> > > On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos
> > > <[email protected]> wrote:
> > >
> > > > The difference in parsing speed between thrift and flatbuffer is
> >25x.
> > > > Thrift has some fundamental design decisions that make decoding slow:
> > > > 1. the thrift compact protocol is very data dependent: uleb encoding
> > for
> > > > integers, field ids are deltas from previous. The data dependencies
> > > > disallow pipelining of modern cpus
> > > > 2. object model does not have a way to use arenas to avoid many
> > > allocations
> > > > of objects
> > > > If we keep thrift, we can potentially get 2 fixed, but fixing 1
> > requires
> > > > changes to the thrift serialization protocol. Such a change is not
> > > > different from switching serialization format.
> > > >
> > > >
> > > > On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <[email protected]
> >
> > > > wrote:
> > > >
> > > > > I wanted to share some work Xiangpeng Hao did at InfluxData this
> > summer
> > > > on
> > > > > the current (thrift) metadata format[1].
> > > > >
> > > > > We found that with careful software engineering, we could likely
> > > improve
> > > > > the speed of reading existing parquet footer format by a factor of
> 4
> > or
> > > > > more ([2] contains some specific ideas). While we analyzed the
> > > > > Rust implementation, I believe a similar conclusion applies to
> C/C++.
> > > > >
> > > > > I realize that there are certain features that switching to an
> > entirely
> > > > new
> > > > > footer format would achieve, but the cost to adopting a new format
> > > > > across the ecosystem is immense (e.g. Parquet "version 2.0" etc).
> > > > >
> > > > > It is my opinion that investing the same effort in software
> > > optimization
> > > > > that would be required for a new footer format would have a much
> > bigger
> > > > > impact
> > > > >
> > > > > Andrew
> > > > >
> > > > > [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> > > > > [2]: https://github.com/apache/arrow-rs/issues/5853
> > > > >
> > > > > On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > Hi Julien.
> > > > > >
> > > > > > Thank you for reconnecting the threads.
> > > > > >
> > > > > > I have broken down my experiments in a narrative, commit by
> commit
> > on
> > > > how
> > > > > > we can go from flatbuffers being ~2x larger than thrift to being
> > > > smaller
> > > > > > (and at times even half) the size of thrift. This is still on an
> > > > internal
> > > > > > branch, I will resume work towards the end of this month to port
> it
> > > to
> > > > > > arrow so that folks can look at the progress and share ideas.
> > > > > >
> > > > > > On the benchmarking front I need to build and share a binary for
> > > third
> > > > > > parties to donate their footers for analysis.
> > > > > >
> > > > > > The PR for parquet extensions has gotten a few rounds of reviews:
> > > > > > https://github.com/apache/parquet-format/pull/254. I hope it
> will
> > be
> > > > > > merged
> > > > > > soon.
> > > > > >
> > > > > > I missed the sync yesterday - for some reason I didn't receive an
> > > > > > invitation. Julien could you add me again to the invite list?
> > > > > >
> > > > > > On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > This came up in the sync today.
> > > > > > >
> > > > > > > There are a few concurrent experiments with flatbuffers for a
> > > future
> > > > > > > Parquet footer replacement. In itself it is fine and just
> wanted
> > to
> > > > > > > reconnect the threads here so that folks are aware of each
> other
> > > and
> > > > > can
> > > > > > > share findings.
> > > > > > >
> > > > > > > - Neelaksh benchmarking and experiments:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
> > > > > > >
> https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
> > > > > > >
> > > > > > > - Alkis has also been experimenting and led the proposal for
> > > enabling
> > > > > > > extending the existing footer.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > > > > > >
> > > > > > > - Xuwei also shared that he is looking into this.
> > > > > > >
> > > > > > > I would suggest that you all reply to this thread sharing your
> > > > current
> > > > > > > progress or ideas and a link to your respective repos for
> > > > > experimenting.
> > > > > > >
> > > > > > > Best
> > > > > > > Julien
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to