> Alkis, can you elaborate how you brought the size of Flatbuffers down?

I have the internal PR rewritten in separate commits with all the steps. I
plan to publish it to arrow repo as soon as possible. The heavy things in
metadata are statistics, offsets, path_in_schema. It takes ~10 steps to cut
the size down, each of which takes a good chunk of the original size.

On Thu, Aug 15, 2024 at 2:43 PM Jan Finis <jpfi...@gmail.com> wrote:

> I guess most close source implementations have done these optimizations
> already, it has just not been done in the open source versions. E.g., we
> switched to a custom-built thrift runtime using pool allocators and string
> views instead of copied strings a few years ago, seeing comparable
> speed-ups. The C++ thrift library is just horribly inefficient.
>
> I agree with Alkis though that there are some gains that can be achieved by
> optimizing, but the format has inherent drawbacks. Flatbuffers is indeed
> more efficient but at the cost of increased size.
> Alkis, can you elaborate how you brought the size of Flatbuffers down?
>
> Cheers,
> Jan
>
> Am Do., 15. Aug. 2024 um 13:50 Uhr schrieb Andrew Lamb <
> andrewlam...@gmail.com>:
>
> > I don't disagree that flatbuffers would be faster than thrift decoding
> >
> > I am trying to say that with software engineering only (no change to the
> > format) it is likely possible to increase parquet thrift metadata parsing
> > speed by 4x.
> >
> > This is not 25x of course, but 4x is non trivial.
> >
> > The fact that no one yet has bothered to invest the time to get the 4x
> yet
> > in open source implementations of parquet suggests to me that the parsing
> > time may not be as critical an issue as we think
> >
> > Andrew
> >
> > On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.invalid> wrote:
> >
> > > The difference in parsing speed between thrift and flatbuffer is >25x.
> > > Thrift has some fundamental design decisions that make decoding slow:
> > > 1. the thrift compact protocol is very data dependent: uleb encoding
> for
> > > integers, field ids are deltas from previous. The data dependencies
> > > disallow pipelining of modern cpus
> > > 2. object model does not have a way to use arenas to avoid many
> > allocations
> > > of objects
> > > If we keep thrift, we can potentially get 2 fixed, but fixing 1
> requires
> > > changes to the thrift serialization protocol. Such a change is not
> > > different from switching serialization format.
> > >
> > >
> > > On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <andrewlam...@gmail.com>
> > > wrote:
> > >
> > > > I wanted to share some work Xiangpeng Hao did at InfluxData this
> summer
> > > on
> > > > the current (thrift) metadata format[1].
> > > >
> > > > We found that with careful software engineering, we could likely
> > improve
> > > > the speed of reading existing parquet footer format by a factor of 4
> or
> > > > more ([2] contains some specific ideas). While we analyzed the
> > > > Rust implementation, I believe a similar conclusion applies to C/C++.
> > > >
> > > > I realize that there are certain features that switching to an
> entirely
> > > new
> > > > footer format would achieve, but the cost to adopting a new format
> > > > across the ecosystem is immense (e.g. Parquet "version 2.0" etc).
> > > >
> > > > It is my opinion that investing the same effort in software
> > optimization
> > > > that would be required for a new footer format would have a much
> bigger
> > > > impact
> > > >
> > > > Andrew
> > > >
> > > > [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> > > > [2]: https://github.com/apache/arrow-rs/issues/5853
> > > >
> > > > On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos
> > > > <alkis.evlogime...@databricks.com.invalid> wrote:
> > > >
> > > > > Hi Julien.
> > > > >
> > > > > Thank you for reconnecting the threads.
> > > > >
> > > > > I have broken down my experiments in a narrative, commit by commit
> on
> > > how
> > > > > we can go from flatbuffers being ~2x larger than thrift to being
> > > smaller
> > > > > (and at times even half) the size of thrift. This is still on an
> > > internal
> > > > > branch, I will resume work towards the end of this month to port it
> > to
> > > > > arrow so that folks can look at the progress and share ideas.
> > > > >
> > > > > On the benchmarking front I need to build and share a binary for
> > third
> > > > > parties to donate their footers for analysis.
> > > > >
> > > > > The PR for parquet extensions has gotten a few rounds of reviews:
> > > > > https://github.com/apache/parquet-format/pull/254. I hope it will
> be
> > > > > merged
> > > > > soon.
> > > > >
> > > > > I missed the sync yesterday - for some reason I didn't receive an
> > > > > invitation. Julien could you add me again to the invite list?
> > > > >
> > > > > On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <jul...@apache.org>
> > > wrote:
> > > > >
> > > > > > This came up in the sync today.
> > > > > >
> > > > > > There are a few concurrent experiments with flatbuffers for a
> > future
> > > > > > Parquet footer replacement. In itself it is fine and just wanted
> to
> > > > > > reconnect the threads here so that folks are aware of each other
> > and
> > > > can
> > > > > > share findings.
> > > > > >
> > > > > > - Neelaksh benchmarking and experiments:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
> > > > > > https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
> > > > > >
> > > > > > - Alkis has also been experimenting and led the proposal for
> > enabling
> > > > > > extending the existing footer.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> > > > > >
> > > > > > - Xuwei also shared that he is looking into this.
> > > > > >
> > > > > > I would suggest that you all reply to this thread sharing your
> > > current
> > > > > > progress or ideas and a link to your respective repos for
> > > > experimenting.
> > > > > >
> > > > > > Best
> > > > > > Julien
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to