Yes the gains are substantial. This is one of the biggest optimizations.

They are between 25% to 75% (4x reduction) depending on how much other
stuff the footer has. Footers without stats get about 4x smaller. With
stats they are 2x smaller.

On Wed, Aug 28, 2024 at 10:32 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Do you gain much from limiting row groups to 2^31 values and bytes? I
> generally find 32-bit lengths to a bit an anti-pattern, as they require
> dedicated logic in the writer to ensure sufficient chunking.
>
> Regards
>
> Antoine.
>
>
> On Mon, 26 Aug 2024 10:35:38 +0200
> Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.INVALID>
> wrote:
> > At the top of the benchmark code I have numbers and short description of
> > each optimization:
> >
> https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129
> >
> > Summarizing them here:
> > - statistics min/max: use fixed 4/8 bytes for types of known length and
> > leave variable length for encoding for binary strings
> > - limit row groups to 2^31 values and 2^31 bytes and make all column
> chunk
> > offsets relative to the row group offset
> > - skip writing `num_values` in column chunk if it is the same as
> > `num_values` in row group (this is very common in practice)
> > - remove `encoding_stats` and replace with a boolean denoting that all
> > pages are dict encoded or not (engines use this to do dictId only
> execution)
> > - remove `path_in_schema` (can be computed dynamically after parsing as
> > necessary)
> > - remove deprecated `file_offset` in column chunk
> > - statistics min/max for strings: encode as common prefix + fixed 8 bytes
> > for min/max, zero padded
> >
> > Cheers,
> >
> > On Fri, Aug 23, 2024 at 6:59 PM Jan Finis <
> jpfinis-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >
> > > Amazing, thanks Alkis!
> > >
> > > Can you give a quick comment on what specific fact made the footers so
> much
> > > smaller in their flatbuf representation? Given that flatbuf compresses
> way
> > > less aggressively than thrift, this seems counterintuitive, I would
> have
> > > rather expected quite some size gain.
> > >
> > > Cheers,
> > > Jan
> > >
> > > Am Fr., 23. Aug. 2024 um 03:41 Uhr schrieb Corwin Joy
> <corwinjoy-re5jqeeq...@public.gmane.orgm
> > > >:
> > >
> > > > This looks great! I have added some initial simple comments on the
> PR
> > > that
> > > > may help others who want to take a look.
> > > >
> > > > On Thu, Aug 22, 2024 at 5:46 PM Julien Le Dem
> <julien-1odqgaof3llqfi55v6+...@public.gmane.orgg> wrote:
> > > >
> > > > > this looks great,
> > > > > thank you for sharing.
> > > > >
> > > > >
> > > > > On Thu, Aug 22, 2024 at 10:42 AM Alkis Evlogimenos
> > > > > <
> alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl...@public.gmane.org>
> wrote:
> > > > >
> > > > > > Hey folks.
> > > > > >
> > > > > > As promised I pushed a PR to the main repo with my attempt to use
> > > > > > flatbuffers for metadata for parquet:
> > > > > > https://github.com/apache/arrow/pull/43793
> > > > > >
> > > > > > The PR builds on top of the metadata extensions in parquet
> > > > > > https://github.com/apache/parquet-format/pull/254 and tests how
> fast
> > > > we
> > > > > > can
> > > > > > parse thrift, thrift+flatbuf, flatbuf alone and also how much
> time it
> > > > > takes
> > > > > > to encode flatbuf. In addition at the start of the benchmark it
> > > prints
> > > > > out
> > > > > > the number of row groups/column chunks and thrift/flatbuffer
> > > serialized
> > > > > > bytes.
> > > > > >
> > > > > > I structured the commits to contain one optimization each to
> make
> > > their
> > > > > > effects more visible. I have tracked the progress at the top of
> the
> > > > > > benchmark
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129
>
> > > > > > >
> > > > > > .
> > > > > >
> > > > > > The current state is complete sans encryption support. All the
> bugs
> > > are
> > > > > > mine but ideas are coming from a few folks inside Databricks.
> As
> > > > expected
> > > > > > parsing the thrift+extension footer incurs a very small
> regression
> > > > (~1%).
> > > > > > Parsing/verifying flatbuffers is >20x faster than thrift so I
> haven't
> > > > > tried
> > > > > > to make changes to its structure for speed. In the last commit
> the
> > > size
> > > > > of
> > > > > > flatbuffer metadata is anywhere from slightly smaller to more
> than 4x
> > > > > > smaller (!!!).
> > > > > >
> > > > > > Unfortunately I can't share the footers I used yet. I am going
> to
> > > wait
> > > > > for
> > > > > > donations <https://github.com/apache/parquet-benchmark/pull/1>
> to
> > > the
> > > > > > parquet-benchmarks repository and rerun the benchmark against
> them.
> > > > > >
> > > > > > I would like to invite anyone interested in collaborating to
> take a
> > > > look
> > > > > at
> > > > > > the PR, consider the design decisions made, experiment with it,
> and
> > > > > > contribute.
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
>

Reply via email to