Amazing, thanks Alkis!

Can you give a quick comment on what specific fact made the footers so much
smaller in their flatbuf representation? Given that flatbuf compresses way
less aggressively than thrift, this seems counterintuitive, I would have
rather expected quite some size gain.

Cheers,
Jan

Am Fr., 23. Aug. 2024 um 03:41 Uhr schrieb Corwin Joy <corwin...@gmail.com>:

> This looks great! I have added some initial simple comments on the PR that
> may help others who want to take a look.
>
> On Thu, Aug 22, 2024 at 5:46 PM Julien Le Dem <jul...@apache.org> wrote:
>
> > this looks great,
> > thank you for sharing.
> >
> >
> > On Thu, Aug 22, 2024 at 10:42 AM Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.invalid> wrote:
> >
> > > Hey folks.
> > >
> > > As promised I pushed a PR to the main repo with my attempt to use
> > > flatbuffers for metadata for parquet:
> > > https://github.com/apache/arrow/pull/43793
> > >
> > > The PR builds on top of the metadata extensions in parquet
> > > https://github.com/apache/parquet-format/pull/254 and tests how fast
> we
> > > can
> > > parse thrift, thrift+flatbuf, flatbuf alone and also how much time it
> > takes
> > > to encode flatbuf. In addition at the start of the benchmark it prints
> > out
> > > the number of row groups/column chunks and thrift/flatbuffer serialized
> > > bytes.
> > >
> > > I structured the commits to contain one optimization each to make their
> > > effects more visible. I have tracked the progress at the top of the
> > > benchmark
> > > <
> > >
> >
> https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129
> > > >
> > > .
> > >
> > > The current state is complete sans encryption support. All the bugs are
> > > mine but ideas are coming from a few folks inside Databricks. As
> expected
> > > parsing the thrift+extension footer incurs a very small regression
> (~1%).
> > > Parsing/verifying flatbuffers is >20x faster than thrift so I haven't
> > tried
> > > to make changes to its structure for speed. In the last commit the size
> > of
> > > flatbuffer metadata is anywhere from slightly smaller to more than 4x
> > > smaller (!!!).
> > >
> > > Unfortunately I can't share the footers I used yet. I am going to wait
> > for
> > > donations <https://github.com/apache/parquet-benchmark/pull/1> to the
> > > parquet-benchmarks repository and rerun the benchmark against them.
> > >
> > > I would like to invite anyone interested in collaborating to take a
> look
> > at
> > > the PR, consider the design decisions made, experiment with it, and
> > > contribute.
> > >
> > > Thank you!
> > >
> >
>

Reply via email to