Hello Julien. I finally got around compiling binaries for the benchmarking repo. Can you add an empty README.md in https://github.com/apache/parquet-benchmark because otherwise I can't fork an empty repo (!!!).
Cheers, On Wed, Aug 7, 2024 at 12:52 AM Julien Le Dem <jul...@apache.org> wrote: > That works for me. > @Alkis Evlogimenos <alkis.evlogime...@databricks.com> When you open a PR > on parquet-benchmark, just make it clear how this binary got there and that > it is an unofficial build from the arrow project waiting for an official > release. > > > > On Tue, Aug 6, 2024 at 7:52 AM Rok Mihevc <rok.mih...@gmail.com> wrote: > >> That would be a temporary solution until parquet-cpp is released? Seems ok >> as it's a utility thing. >> >> On Tue, Aug 6, 2024 at 4:03 PM Alkis Evlogimenos >> <alkis.evlogime...@databricks.com.invalid> wrote: >> >> > Perhaps it is best to compile static binaries of the above and upload to >> > https://github.com/apache/parquet-benchmark along with a readme? >> > >> > On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc <rok.mih...@gmail.com> wrote: >> > >> > > Arrow releases are cut ~every three months and the last release was >> mid >> > > July (https://arrow.apache.org/release/17.0.0.html). >> > > I would speculate 18.0.0 will be public mid September. >> > > >> > > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos >> > > <alkis.evlogime...@databricks.com.invalid> wrote: >> > > >> > > > Thank you Julien. When can we expect a new arrow package release so >> > that >> > > I >> > > > can compile a doc for customers to donate footers to us? >> > > > >> > > > binary in question: >> > > > >> > > > >> > > >> > >> https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc >> > > > >> > > > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem <jul...@apache.org> >> > wrote: >> > > > >> > > > > Following up on my action item, I have created the >> parquet-benchmark >> > > > repo: >> > > > > https://github.com/apache/parquet-benchmark >> > > > > >> > > > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem <jul...@apache.org> >> > > wrote: >> > > > > >> > > > > > Attendees: >> > > > > > >> > > > > > - >> > > > > > >> > > > > > Micah: Google, no special topic today >> > > > > > - >> > > > > > >> > > > > > Alkis: Databricks, storage stack. Topic: Parquet extension >> PR so >> > > > that >> > > > > > we can go in the format. Want to fix the metadata to make it >> > work >> > > > for >> > > > > wide >> > > > > > schemas. >> > > > > > - >> > > > > > >> > > > > > Vinoo: Palantir -> startup in data space. Working on >> improving >> > the >> > > > > > website. >> > > > > > - >> > > > > > >> > > > > > Julien: Datadog. Topic: Make parquet reading possible to be >> done >> > > > > > sequentially (as opposed to footer first) >> > > > > > - >> > > > > > >> > > > > > Rok: Voltron -> freelance in Fintech. Care about Parquet >> > > > performance. >> > > > > > Have time to contribute to footers (“V3”). >> > > > > > >> > > > > > >> > > > > > Follow up items: >> > > > > > >> > > > > > Mika’s Parquet format changes process >> > > > > > >> > > > > > - >> > > > > > >> > > > > > First PR merged, need to finalize java >> > > > > > - >> > > > > > >> > > > > > => Mostly done >> > > > > > >> > > > > > Jira -> github migration >> > > > > > >> > > > > > - >> > > > > > >> > > > > > Getting started with github. Will follow up on the mailing >> list. >> > > > > > - >> > > > > > >> > > > > > => mostly closed discussion. Some follow up async on the >> > > discussion. >> > > > > > >> > > > > > >> > > > > > Agenda: >> > > > > > >> > > > > > - >> > > > > > >> > > > > > Finalizing [EXTERNAL] Parquet extensions >> > > > > > < >> > > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6 >> > > > > > >> > > > > > >> > > > > > - >> > > > > > >> > > > > > AI: Alkis Evlogimenos <alkis.evlogime...@databricks.com> >> to >> > > > update >> > > > > > PR with everything in the doc except Alternatives >> Considered >> > > and >> > > > > split the >> > > > > > examples in another page. >> > > > > > - >> > > > > > >> > > > > > New footer metadata discussion. >> > > > > > >> > > > > > >> > > > > > Discussion: >> > > > > > >> > > > > > - >> > > > > > >> > > > > > Extensions: >> > > > > > - >> > > > > > >> > > > > > Add functionality to read/write the extension and show >> that >> > we >> > > > can >> > > > > > ignore it. >> > > > > > - >> > > > > > >> > > > > > 1: write an extension and read the old footer that >> ignores >> > > it. >> > > > > > - >> > > > > > >> > > > > > 2: write extension and allow reading it back. >> > > > > > - >> > > > > > >> > > > > > New metadata: >> > > > > > - >> > > > > > >> > > > > > Flatbuffer is bigger than thrift: need to optimize >> metadata >> > > > > > - >> > > > > > >> > > > > > Start from a 1-1 implementation to existing footer and >> > keep >> > > > > > iterating 1 commit at a time. >> > > > > > - >> > > > > > >> > > > > > Would like to have a branch in github arrow cpp or a >> public >> > > fork >> > > > on >> > > > > > github to share the prototype. >> > > > > > - >> > > > > > >> > > > > > Add to parquet-tool to print the footer. >> > > > > > - >> > > > > > >> > > > > > Add utility to obfuscate schema so that people can >> share >> > > their >> > > > > > metadata without sharing proprietary information. >> > > > > > - >> > > > > > >> > > > > > That way we can have data about slow footers and >> validate >> > we >> > > > can >> > > > > > read faster with the new footer >> > > > > > - >> > > > > > >> > > > > > => creation of a database of footers. >> > > > > > - >> > > > > > >> > > > > > Getting a feel of what features are used by users. >> > > > > > - >> > > > > > >> > > > > > Alkis would want to share his findings through a blog >> > post. >> > > > > > - >> > > > > > >> > > > > > Also need to make sure the addition of the new footer >> doesn’t >> > > > > > impact old footers too much. >> > > > > > - >> > > > > > >> > > > > > Possibly: >> > > > > > - >> > > > > > >> > > > > > Codspeed for performance testing >> > > > > > - >> > > > > > >> > > > > > Thrift linter: >> https://github.com/thrift-labs/thrift-fmt >> > > > > > - >> > > > > > >> > > > > > AI: >> > > > > > - >> > > > > > >> > > > > > [Julien] Create a parquet-benchmark repo for a footer >> db >> > and >> > > > > > other things >> > > > > > - >> > > > > > >> > > > > > Example: https://github.com/rok/parquet-benchmark >> > > > > > - >> > > > > > >> > > > > > Alkis to pick where on github to push his prototype >> branch >> > > > > > - >> > > > > > >> > > > > > Follow up on: >> > > > > > - >> > > > > > >> > > > > > https://github.com/apache/parquet-format/pull/445 >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >