Yes the gains are substantial. This is one of the biggest optimizations. They are between 25% to 75% (4x reduction) depending on how much other stuff the footer has. Footers without stats get about 4x smaller. With stats they are 2x smaller.
On Wed, Aug 28, 2024 at 10:32 AM Antoine Pitrou <anto...@python.org> wrote: > > Do you gain much from limiting row groups to 2^31 values and bytes? I > generally find 32-bit lengths to a bit an anti-pattern, as they require > dedicated logic in the writer to ensure sufficient chunking. > > Regards > > Antoine. > > > On Mon, 26 Aug 2024 10:35:38 +0200 > Alkis Evlogimenos > <alkis.evlogime...@databricks.com.INVALID> > wrote: > > At the top of the benchmark code I have numbers and short description of > > each optimization: > > > https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129 > > > > Summarizing them here: > > - statistics min/max: use fixed 4/8 bytes for types of known length and > > leave variable length for encoding for binary strings > > - limit row groups to 2^31 values and 2^31 bytes and make all column > chunk > > offsets relative to the row group offset > > - skip writing `num_values` in column chunk if it is the same as > > `num_values` in row group (this is very common in practice) > > - remove `encoding_stats` and replace with a boolean denoting that all > > pages are dict encoded or not (engines use this to do dictId only > execution) > > - remove `path_in_schema` (can be computed dynamically after parsing as > > necessary) > > - remove deprecated `file_offset` in column chunk > > - statistics min/max for strings: encode as common prefix + fixed 8 bytes > > for min/max, zero padded > > > > Cheers, > > > > On Fri, Aug 23, 2024 at 6:59 PM Jan Finis < > jpfinis-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > > > > Amazing, thanks Alkis! > > > > > > Can you give a quick comment on what specific fact made the footers so > much > > > smaller in their flatbuf representation? Given that flatbuf compresses > way > > > less aggressively than thrift, this seems counterintuitive, I would > have > > > rather expected quite some size gain. > > > > > > Cheers, > > > Jan > > > > > > Am Fr., 23. Aug. 2024 um 03:41 Uhr schrieb Corwin Joy > <corwinjoy-re5jqeeq...@public.gmane.orgm > > > >: > > > > > > > This looks great! I have added some initial simple comments on the > PR > > > that > > > > may help others who want to take a look. > > > > > > > > On Thu, Aug 22, 2024 at 5:46 PM Julien Le Dem > <julien-1odqgaof3llqfi55v6+...@public.gmane.orgg> wrote: > > > > > > > > > this looks great, > > > > > thank you for sharing. > > > > > > > > > > > > > > > On Thu, Aug 22, 2024 at 10:42 AM Alkis Evlogimenos > > > > > < > alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl...@public.gmane.org> > wrote: > > > > > > > > > > > Hey folks. > > > > > > > > > > > > As promised I pushed a PR to the main repo with my attempt to use > > > > > > flatbuffers for metadata for parquet: > > > > > > https://github.com/apache/arrow/pull/43793 > > > > > > > > > > > > The PR builds on top of the metadata extensions in parquet > > > > > > https://github.com/apache/parquet-format/pull/254 and tests how > fast > > > > we > > > > > > can > > > > > > parse thrift, thrift+flatbuf, flatbuf alone and also how much > time it > > > > > takes > > > > > > to encode flatbuf. In addition at the start of the benchmark it > > > prints > > > > > out > > > > > > the number of row groups/column chunks and thrift/flatbuffer > > > serialized > > > > > > bytes. > > > > > > > > > > > > I structured the commits to contain one optimization each to > make > > > their > > > > > > effects more visible. I have tracked the progress at the top of > the > > > > > > benchmark > > > > > > < > > > > > > > > > > > > > > > > > > > https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129 > > > > > > > > > > > > > > . > > > > > > > > > > > > The current state is complete sans encryption support. All the > bugs > > > are > > > > > > mine but ideas are coming from a few folks inside Databricks. > As > > > > expected > > > > > > parsing the thrift+extension footer incurs a very small > regression > > > > (~1%). > > > > > > Parsing/verifying flatbuffers is >20x faster than thrift so I > haven't > > > > > tried > > > > > > to make changes to its structure for speed. In the last commit > the > > > size > > > > > of > > > > > > flatbuffer metadata is anywhere from slightly smaller to more > than 4x > > > > > > smaller (!!!). > > > > > > > > > > > > Unfortunately I can't share the footers I used yet. I am going > to > > > wait > > > > > for > > > > > > donations <https://github.com/apache/parquet-benchmark/pull/1> > to > > > the > > > > > > parquet-benchmarks repository and rerun the benchmark against > them. > > > > > > > > > > > > I would like to invite anyone interested in collaborating to > take a > > > > look > > > > > at > > > > > > the PR, consider the design decisions made, experiment with it, > and > > > > > > contribute. > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > > > > > > > > >