Attendees: -
Micah: Google, no special topic today - Alkis: Databricks, storage stack. Topic: Parquet extension PR so that we can go in the format. Want to fix the metadata to make it work for wide schemas. - Vinoo: Palantir -> startup in data space. Working on improving the website. - Julien: Datadog. Topic: Make parquet reading possible to be done sequentially (as opposed to footer first) - Rok: Voltron -> freelance in Fintech. Care about Parquet performance. Have time to contribute to footers (“V3”). Follow up items: Mika’s Parquet format changes process - First PR merged, need to finalize java - => Mostly done Jira -> github migration - Getting started with github. Will follow up on the mailing list. - => mostly closed discussion. Some follow up async on the discussion. Agenda: - Finalizing [EXTERNAL] Parquet extensions <https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6> - AI: Alkis Evlogimenos <alkis.evlogime...@databricks.com> to update PR with everything in the doc except Alternatives Considered and split the examples in another page. - New footer metadata discussion. Discussion: - Extensions: - Add functionality to read/write the extension and show that we can ignore it. - 1: write an extension and read the old footer that ignores it. - 2: write extension and allow reading it back. - New metadata: - Flatbuffer is bigger than thrift: need to optimize metadata - Start from a 1-1 implementation to existing footer and keep iterating 1 commit at a time. - Would like to have a branch in github arrow cpp or a public fork on github to share the prototype. - Add to parquet-tool to print the footer. - Add utility to obfuscate schema so that people can share their metadata without sharing proprietary information. - That way we can have data about slow footers and validate we can read faster with the new footer - => creation of a database of footers. - Getting a feel of what features are used by users. - Alkis would want to share his findings through a blog post. - Also need to make sure the addition of the new footer doesn’t impact old footers too much. - Possibly: - Codspeed for performance testing - Thrift linter: https://github.com/thrift-labs/thrift-fmt - AI: - [Julien] Create a parquet-benchmark repo for a footer db and other things - Example: https://github.com/rok/parquet-benchmark - Alkis to pick where on github to push his prototype branch - Follow up on: - https://github.com/apache/parquet-format/pull/445