Attendees:

   -

   Micah: Google, no special topic today
   -

   Alkis: Databricks, storage stack. Topic: Parquet extension PR so that we
   can go in the format. Want to fix the metadata to make it work for wide
   schemas.
   -

   Vinoo: Palantir -> startup in data space. Working on improving the
   website.
   -

   Julien: Datadog. Topic: Make parquet reading possible to be done
   sequentially (as opposed to footer first)
   -

   Rok: Voltron -> freelance in Fintech. Care about Parquet performance.
   Have time to contribute to footers (“V3”).


Follow up items:

Mika’s Parquet format changes process

   -

   First PR merged, need to finalize java
   -

   => Mostly done

Jira -> github migration

   -

   Getting started with github. Will follow up on the mailing list.
   -

   => mostly closed discussion. Some follow up async on the discussion.


Agenda:

   -

   Finalizing [EXTERNAL] Parquet extensions
   
<https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>

   -

      AI: Alkis Evlogimenos <alkis.evlogime...@databricks.com> to update PR
      with everything in the doc except Alternatives Considered and split the
      examples in another page.
      -

   New footer metadata discussion.


Discussion:

   -

   Extensions:
   -

      Add functionality to read/write the extension and show that we can
      ignore it.
      -

         1: write an extension and read the old footer that ignores it.
         -

         2: write extension and allow reading it back.
         -

   New metadata:
   -

      Flatbuffer is bigger than thrift: need to optimize metadata
      -

         Start from a 1-1 implementation to existing footer and keep
         iterating 1 commit at a time.
         -

      Would like to have a branch in github arrow cpp or a public fork on
      github to share the prototype.
      -

      Add to parquet-tool to print the footer.
      -

         Add utility to obfuscate schema so that people can share their
         metadata without sharing proprietary information.
         -

         That way we can have data about slow footers and validate we can
         read faster with the new footer
         -

         => creation of a database of footers.
         -

      Getting a feel of what features are used by users.
      -

         Alkis would want to share his findings through a blog post.
         -

      Also need to make sure the addition of the new footer doesn’t impact
      old footers too much.
      -

      Possibly:
      -

         Codspeed for performance testing
         -

         Thrift linter: https://github.com/thrift-labs/thrift-fmt
         -

      AI:
      -

         [Julien] Create a parquet-benchmark repo for a footer db and other
         things
         -

            Example: https://github.com/rok/parquet-benchmark
            -

         Alkis to pick where on github to push his prototype branch
         -

         Follow up on:
         -

            https://github.com/apache/parquet-format/pull/445

Reply via email to