Re: [DISCUSS] new Parquet footer experiments

Alkis Evlogimenos Wed, 28 Aug 2024 02:26:28 -0700

Yes once https://github.com/apache/parquet-benchmark/pull/1 is merged, I
plan to pull a few of those footers in so that they can seed the project
and at the same time show how donations should be made.


On Wed, Aug 28, 2024 at 10:25 AM Antoine Pitrou <anto...@python.org> wrote:

>
> I suppose you already know this, but you can use public datasets as a
> source of real-world Parquet footers.
>
> For example, the GeoParquet website lists a couple data providers:
> https://geoparquet.org/
>
> Regards
>
> Antoine.
>
>
> On Sun, 18 Aug 2024 14:20:28 +0200
> Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.INVALID>
> wrote:
> > The biggest thing about benchmarks is the data itself. This is why I
> added
> > this binary https://github.com/apache/arrow/pull/42174 to help users
> donate
> > their slow to parse footers for benchmarking purposes.
> >
> > On Sat, Aug 17, 2024 at 11:43 AM Neelaksh Singh <neelaks...@gmail.com>
> > wrote:
> >
> > > One more point that I would like to mention here is that we have put a
> lot
> > > of effort into REPRODUCIBILITY for this benchmark repo. There have
> been a
> > > lot of great benchmarking efforts that have been done as part of this
> > > discussion. However, one limitation is that many of the experiments
> have
> > > not included code or take a fair bit of effort to setup. We've made
> strong
> > > efforts here using Docker and vcpkg to make the setup for these
> benchmarks
> > > as transparent and reproducible as possible. Our hope is that this will
> > > provide a useful contribution for others to either reproduce many of
> the
> > > results that have been discussed or easily run their own experiments
> when
> > > trying alternatives. We hope this will help facilitate the discussion
> with
> > > easily shareable experiments.
> > >
> > > On Thu, Aug 15, 2024, 9:21 PM Alkis Evlogimenos
> > > <alkis.evlogime...@databricks.com.invalid> wrote:
> > >
> > > > > Alkis, can you elaborate how you brought the size of Flatbuffers
> down?
> > > >
> > > > I have the internal PR rewritten in separate commits with all the
> steps.
> > > I
> > > > plan to publish it to arrow repo as soon as possible. The heavy
> things in
> > > > metadata are statistics, offsets, path_in_schema. It takes ~10 steps
> to
> > > cut
> > > > the size down, each of which takes a good chunk of the original size.
> > > >
> > > > On Thu, Aug 15, 2024 at 2:43 PM Jan Finis <
> jpfinis-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > >
> > > > > I guess most close source implementations have done these
> optimizations
> > > > > already, it has just not been done in the open source versions.
> E.g.,
> > > we
> > > > > switched to a custom-built thrift runtime using pool allocators
> and
> > > > string
> > > > > views instead of copied strings a few years ago, seeing comparable
> > > > > speed-ups. The C++ thrift library is just horribly inefficient.
> > > > >
> > > > > I agree with Alkis though that there are some gains that can be
> > > achieved
> > > > by
> > > > > optimizing, but the format has inherent drawbacks. Flatbuffers is
> > > indeed
> > > > > more efficient but at the cost of increased size.
> > > > > Alkis, can you elaborate how you brought the size of Flatbuffers
> down?
> > > > >
> > > > > Cheers,
> > > > > Jan
> > > > >
> > > > > Am Do., 15. Aug. 2024 um 13:50 Uhr schrieb Andrew Lamb <
> > > > > andrewlam...@gmail.com>:
> > > > >
> > > > > > I don't disagree that flatbuffers would be faster than thrift
> > > decoding
> > > > > >
> > > > > > I am trying to say that with software engineering only (no
> change to
> > > > the
> > > > > > format) it is likely possible to increase parquet thrift
> metadata
> > > > parsing
> > > > > > speed by 4x.
> > > > > >
> > > > > > This is not 25x of course, but 4x is non trivial.
> > > > > >
> > > > > > The fact that no one yet has bothered to invest the time to get
> the
> > > 4x
> > > > > yet
> > > > > > in open source implementations of parquet suggests to me that
> the
> > > > parsing
> > > > > > time may not be as critical an issue as we think
> > > > > >
> > > > > > Andrew
> > > > > >
> > > > > > On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos
> > > > > > <
> alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl...@public.gmane.org>
> wrote:
> > > > > >
> > > > > > > The difference in parsing speed between thrift and flatbuffer
> is
> > > > >25x.
> > > > > > > Thrift has some fundamental design decisions that make
> decoding
> > > slow:
> > > > > > > 1. the thrift compact protocol is very data dependent: uleb
> > > encoding
> > > > > for
> > > > > > > integers, field ids are deltas from previous. The data
> dependencies
> > > > > > > disallow pipelining of modern cpus
> > > > > > > 2. object model does not have a way to use arenas to avoid
> many
> > > > > > allocations
> > > > > > > of objects
> > > > > > > If we keep thrift, we can potentially get 2 fixed, but fixing
> 1
> > > > > requires
> > > > > > > changes to the thrift serialization protocol. Such a change is
> not
> > > > > > > different from switching serialization format.
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <
> > > andrewlam...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I wanted to share some work Xiangpeng Hao did at InfluxData
> this
> > > > > summer
> > > > > > > on
> > > > > > > > the current (thrift) metadata format[1].
> > > > > > > >
> > > > > > > > We found that with careful software engineering, we could
> likely
> > > > > > improve
> > > > > > > > the speed of reading existing parquet footer format by a
> factor
> > > of
> > > > 4
> > > > > or
> > > > > > > > more ([2] contains some specific ideas). While we analyzed
> the
> > > > > > > > Rust implementation, I believe a similar conclusion applies
> to
> > > > C/C++.
> > > > > > > >
> > > > > > > > I realize that there are certain features that switching to
> an
> > > > > entirely
> > > > > > > new
> > > > > > > > footer format would achieve, but the cost to adopting a new
> > > format
> > > > > > > > across the ecosystem is immense (e.g. Parquet "version 2.0"
> etc).
> > > > > > > >
> > > > > > > > It is my opinion that investing the same effort in software
> > > > > > optimization
> > > > > > > > that would be required for a new footer format would have a
> much
> > > > > bigger
> > > > > > > > impact
> > > > > > > >
> > > > > > > > Andrew
> > > > > > > >
> > > > > > > > [1]:
> > > https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> > > > > > > > [2]: https://github.com/apache/arrow-rs/issues/5853
> > > > > > > >
> > > > > > > > On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos
> > > > > > > > <
> alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl...@public.gmane.org>
> wrote:
> > > > > > > >
> > > > > > > > > Hi Julien.
> > > > > > > > >
> > > > > > > > > Thank you for reconnecting the threads.
> > > > > > > > >
> > > > > > > > > I have broken down my experiments in a narrative, commit
> by
> > > > commit
> > > > > on
> > > > > > > how
> > > > > > > > > we can go from flatbuffers being ~2x larger than thrift
> to
> > > being
> > > > > > > smaller
> > > > > > > > > (and at times even half) the size of thrift. This is still
> on
> > > an
> > > > > > > internal
> > > > > > > > > branch, I will resume work towards the end of this month
> to
> > > port
> > > > it
> > > > > > to
> > > > > > > > > arrow so that folks can look at the progress and share
> ideas.
> > > > > > > > >
> > > > > > > > > On the benchmarking front I need to build and share a
> binary
> > > for
> > > > > > third
> > > > > > > > > parties to donate their footers for analysis.
> > > > > > > > >
> > > > > > > > > The PR for parquet extensions has gotten a few rounds of
> > > reviews:
> > > > > > > > > https://github.com/apache/parquet-format/pull/254. I hope
> it
> > > > will
> > > > > be
> > > > > > > > > merged
> > > > > > > > > soon.
> > > > > > > > >
> > > > > > > > > I missed the sync yesterday - for some reason I didn't
> receive
> > > an
> > > > > > > > > invitation. Julien could you add me again to the invite
> list?
> > > > > > > > >
> > > > > > > > > On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <
> > > jul...@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > This came up in the sync today.
> > > > > > > > > >
> > > > > > > > > > There are a few concurrent experiments with flatbuffers
> for a
> > > > > > future
> > > > > > > > > > Parquet footer replacement. In itself it is fine and
> just
> > > > wanted
> > > > > to
> > > > > > > > > > reconnect the threads here so that folks are aware of
> each
> > > > other
> > > > > > and
> > > > > > > > can
> > > > > > > > > > share findings.
> > > > > > > > > >
> > > > > > > > > > - Neelaksh benchmarking and experiments:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
>
> > > > > > > > > >
> > > > https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
> > > > > > > > > >
> > > > > > > > > > - Alkis has also been experimenting and led the proposal
> for
> > > > > > enabling
> > > > > > > > > > extending the existing footer.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
>
> > > > > > > > > >
> > > > > > > > > > - Xuwei also shared that he is looking into this.
> > > > > > > > > >
> > > > > > > > > > I would suggest that you all reply to this thread
> sharing
> > > your
> > > > > > > current
> > > > > > > > > > progress or ideas and a link to your respective repos
> for
> > > > > > > > experimenting.
> > > > > > > > > >
> > > > > > > > > > Best
> > > > > > > > > > Julien
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
>

Re: [DISCUSS] new Parquet footer experiments

Reply via email to