Amazing, thanks Alkis! Can you give a quick comment on what specific fact made the footers so much smaller in their flatbuf representation? Given that flatbuf compresses way less aggressively than thrift, this seems counterintuitive, I would have rather expected quite some size gain.
Cheers, Jan Am Fr., 23. Aug. 2024 um 03:41 Uhr schrieb Corwin Joy <corwin...@gmail.com>: > This looks great! I have added some initial simple comments on the PR that > may help others who want to take a look. > > On Thu, Aug 22, 2024 at 5:46 PM Julien Le Dem <jul...@apache.org> wrote: > > > this looks great, > > thank you for sharing. > > > > > > On Thu, Aug 22, 2024 at 10:42 AM Alkis Evlogimenos > > <alkis.evlogime...@databricks.com.invalid> wrote: > > > > > Hey folks. > > > > > > As promised I pushed a PR to the main repo with my attempt to use > > > flatbuffers for metadata for parquet: > > > https://github.com/apache/arrow/pull/43793 > > > > > > The PR builds on top of the metadata extensions in parquet > > > https://github.com/apache/parquet-format/pull/254 and tests how fast > we > > > can > > > parse thrift, thrift+flatbuf, flatbuf alone and also how much time it > > takes > > > to encode flatbuf. In addition at the start of the benchmark it prints > > out > > > the number of row groups/column chunks and thrift/flatbuffer serialized > > > bytes. > > > > > > I structured the commits to contain one optimization each to make their > > > effects more visible. I have tracked the progress at the top of the > > > benchmark > > > < > > > > > > https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129 > > > > > > > . > > > > > > The current state is complete sans encryption support. All the bugs are > > > mine but ideas are coming from a few folks inside Databricks. As > expected > > > parsing the thrift+extension footer incurs a very small regression > (~1%). > > > Parsing/verifying flatbuffers is >20x faster than thrift so I haven't > > tried > > > to make changes to its structure for speed. In the last commit the size > > of > > > flatbuffer metadata is anywhere from slightly smaller to more than 4x > > > smaller (!!!). > > > > > > Unfortunately I can't share the footers I used yet. I am going to wait > > for > > > donations <https://github.com/apache/parquet-benchmark/pull/1> to the > > > parquet-benchmarks repository and rerun the benchmark against them. > > > > > > I would like to invite anyone interested in collaborating to take a > look > > at > > > the PR, consider the design decisions made, experiment with it, and > > > contribute. > > > > > > Thank you! > > > > > >