Thank you Fokko. PR is up: https://github.com/apache/parquet-benchmark/pull/1
On Tue, Aug 20, 2024 at 12:11 AM Julien Le Dem <jul...@apache.org> wrote: > Thanks Fokko! > > On Mon, Aug 19, 2024 at 11:59 AM Fokko Driesprong <fo...@apache.org> > wrote: > > > Done! > > > > Kind regards, > > Fokko > > > > Op ma 19 aug 2024 om 20:52 schreef Alkis Evlogimenos > > <alkis.evlogime...@databricks.com.invalid>: > > > > > Hello Julien. I finally got around compiling binaries for the > > benchmarking > > > repo. Can you add an empty README.md in > > > https://github.com/apache/parquet-benchmark because otherwise I can't > > fork > > > an empty repo (!!!). > > > > > > Cheers, > > > > > > On Wed, Aug 7, 2024 at 12:52 AM Julien Le Dem <jul...@apache.org> > wrote: > > > > > > > That works for me. > > > > @Alkis Evlogimenos <alkis.evlogime...@databricks.com> When you open > a > > PR > > > > on parquet-benchmark, just make it clear how this binary got there > and > > > that > > > > it is an unofficial build from the arrow project waiting for an > > official > > > > release. > > > > > > > > > > > > > > > > On Tue, Aug 6, 2024 at 7:52 AM Rok Mihevc <rok.mih...@gmail.com> > > wrote: > > > > > > > >> That would be a temporary solution until parquet-cpp is released? > > Seems > > > ok > > > >> as it's a utility thing. > > > >> > > > >> On Tue, Aug 6, 2024 at 4:03 PM Alkis Evlogimenos > > > >> <alkis.evlogime...@databricks.com.invalid> wrote: > > > >> > > > >> > Perhaps it is best to compile static binaries of the above and > > upload > > > to > > > >> > https://github.com/apache/parquet-benchmark along with a readme? > > > >> > > > > >> > On Tue, Aug 6, 2024 at 4:30 PM Rok Mihevc <rok.mih...@gmail.com> > > > wrote: > > > >> > > > > >> > > Arrow releases are cut ~every three months and the last release > > was > > > >> mid > > > >> > > July (https://arrow.apache.org/release/17.0.0.html). > > > >> > > I would speculate 18.0.0 will be public mid September. > > > >> > > > > > >> > > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos > > > >> > > <alkis.evlogime...@databricks.com.invalid> wrote: > > > >> > > > > > >> > > > Thank you Julien. When can we expect a new arrow package > release > > > so > > > >> > that > > > >> > > I > > > >> > > > can compile a doc for customers to donate footers to us? > > > >> > > > > > > >> > > > binary in question: > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://github.com/apache/arrow/blob/main/cpp/tools/parquet/parquet_dump_footer.cc > > > >> > > > > > > >> > > > On Sat, Aug 3, 2024 at 3:17 AM Julien Le Dem < > jul...@apache.org > > > > > > >> > wrote: > > > >> > > > > > > >> > > > > Following up on my action item, I have created the > > > >> parquet-benchmark > > > >> > > > repo: > > > >> > > > > https://github.com/apache/parquet-benchmark > > > >> > > > > > > > >> > > > > On Wed, Jul 31, 2024 at 3:46 PM Julien Le Dem < > > > jul...@apache.org> > > > >> > > wrote: > > > >> > > > > > > > >> > > > > > Attendees: > > > >> > > > > > > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Micah: Google, no special topic today > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Alkis: Databricks, storage stack. Topic: Parquet > > extension > > > >> PR so > > > >> > > > that > > > >> > > > > > we can go in the format. Want to fix the metadata to > make > > > it > > > >> > work > > > >> > > > for > > > >> > > > > wide > > > >> > > > > > schemas. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Vinoo: Palantir -> startup in data space. Working on > > > >> improving > > > >> > the > > > >> > > > > > website. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Julien: Datadog. Topic: Make parquet reading possible > to > > be > > > >> done > > > >> > > > > > sequentially (as opposed to footer first) > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Rok: Voltron -> freelance in Fintech. Care about > Parquet > > > >> > > > performance. > > > >> > > > > > Have time to contribute to footers (“V3”). > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > Follow up items: > > > >> > > > > > > > > >> > > > > > Mika’s Parquet format changes process > > > >> > > > > > > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > First PR merged, need to finalize java > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > => Mostly done > > > >> > > > > > > > > >> > > > > > Jira -> github migration > > > >> > > > > > > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Getting started with github. Will follow up on the > > mailing > > > >> list. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > => mostly closed discussion. Some follow up async on > the > > > >> > > discussion. > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > Agenda: > > > >> > > > > > > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Finalizing [EXTERNAL] Parquet extensions > > > >> > > > > > < > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6 > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > AI: Alkis Evlogimenos < > > alkis.evlogime...@databricks.com > > > > > > > >> to > > > >> > > > update > > > >> > > > > > PR with everything in the doc except Alternatives > > > >> Considered > > > >> > > and > > > >> > > > > split the > > > >> > > > > > examples in another page. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > New footer metadata discussion. > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > Discussion: > > > >> > > > > > > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Extensions: > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Add functionality to read/write the extension and > show > > > >> that > > > >> > we > > > >> > > > can > > > >> > > > > > ignore it. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > 1: write an extension and read the old footer > that > > > >> ignores > > > >> > > it. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > 2: write extension and allow reading it back. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > New metadata: > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Flatbuffer is bigger than thrift: need to optimize > > > >> metadata > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Start from a 1-1 implementation to existing > footer > > > and > > > >> > keep > > > >> > > > > > iterating 1 commit at a time. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Would like to have a branch in github arrow cpp or a > > > >> public > > > >> > > fork > > > >> > > > on > > > >> > > > > > github to share the prototype. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Add to parquet-tool to print the footer. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Add utility to obfuscate schema so that people > can > > > >> share > > > >> > > their > > > >> > > > > > metadata without sharing proprietary information. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > That way we can have data about slow footers and > > > >> validate > > > >> > we > > > >> > > > can > > > >> > > > > > read faster with the new footer > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > => creation of a database of footers. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Getting a feel of what features are used by users. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Alkis would want to share his findings through a > > blog > > > >> > post. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Also need to make sure the addition of the new > footer > > > >> doesn’t > > > >> > > > > > impact old footers too much. > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Possibly: > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Codspeed for performance testing > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Thrift linter: > > > >> https://github.com/thrift-labs/thrift-fmt > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > AI: > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > [Julien] Create a parquet-benchmark repo for a > > footer > > > >> db > > > >> > and > > > >> > > > > > other things > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Example: > > https://github.com/rok/parquet-benchmark > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Alkis to pick where on github to push his > prototype > > > >> branch > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > Follow up on: > > > >> > > > > > - > > > >> > > > > > > > > >> > > > > > > > https://github.com/apache/parquet-format/pull/445 > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > >