Parquet sync notes Aug 14, 2024

Julien Le Dem Wed, 14 Aug 2024 16:23:11 -0700

Attendees:

   -


   Julien: Datadog, metadata improvements, encodings.
   -

   Vinoo: timeseries. Listening in. Parquet compliance.
   -

   Ashish: log analytics, listening in. favorite projects, parquet and
   arrow.
   -

   Claire: Spotify data infra, migrating from avro to parquet. Parquet-avro
   contributor. 1.14.2 release question.
   -

   Dewey: Voltron, Geometry type in C++. Collaborating on Java.
   -

   Neelaksh: GResearch MLH fellow benchmarking parquet C++. Perf for ML
   workloads (10K columns). Appending flatbuffers.
   -

   Rok: Fin tech, efficient wide schema metadata, contributed to arrow C++
   -

   Xuwei: database startup. Contributing to c++ parquet module.
   Arrow-parquet. Listening in.
   -

   Fokko: Databricks. Iceberg. Listening in.


Agenda:

Follow up items:

Alkis to pick where on github to push his prototype branch

   -

   https://github.com/apache/parquet-format/pull/445



   -

   1.14.2 release: bugfix for a bug. Related to avro 1.8
   -

   PSA: file-offset in column Chunk is disabled in C++ and rust impl
   -

   New metadata:
   -

      Appending new footer
      -

      Neelakhs’ benchmarking on metadata (de)serialization (
      
https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1,
      https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking)
      -

   New releases: 1.14.2
   -

      Next 1.15 in september.


Meeting notes:

   -

   1.14.2 release <https://github.com/apache/parquet-java/milestone/28>:
   bugfix for a bug. Related to avro 1.8
   -

      1.14 bug: API used in parquet avro that existed only on 1.10 and
      above. Causes exceptions when using with avro 1.8.
      -

      Fix by Claire: https://github.com/apache/parquet-java/pull/2957
      -

      Required to use the avro api in 1.14.x for older versions of avro.
      -

      Fokko: happy to help with the release
      -

   PSA: FYI, file-offset in column Chunk is disabled in C++ and rust impl,
   if any user relies on it, you can try to check this
   -

      https://github.com/apache/arrow/pull/43428
      -

      https://github.com/apache/arrow-rs/pull/6117
      -

      https://github.com/apache/parquet-format/pull/440
      -

   Neelakhs’ benchmarking on metadata (de)serialization

https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1,
https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking


   -

      Working with G research.
      -

      Reproducible repository (Jupyter notebook)
      -

      Created a Benchmark
      -

         Specifically perf of metadata (thrift) when increasing number of
         columns
         -

         Float 32 for ml workloads
         -

         Full and partial schema load.
         -

         Compression algorithm benchmark.
         -

      Proposal for alternate file format.
      -

         Evaluate performance of flatbuffers
         -

            Converting thrift to flatbuffers.
            -

            Append it to the footer.
            -

            Parquet reader to parse it.
            -

      Next step:
      -

         More Flatbuffers benchmarking
         -

      Xuwei
      -

         About metadata benchmark ( I think their work is interesting):
         https://github.com/apache/arrow-rs/issues/5770
         -

         https://www.influxdata.com/blog/how-good-parquet-wide-tables/
         -

         Do you think a C++ FlatBuffer Metadata API would help? I can draft
         one which could extend footer to a outside FlatBuffer
         -

         Will try to discuss it with Alkis.
         -

         Seems Alkis checked in Scrub in C++ lib, looking forward the
         future work
         -

      Rok: Alkis said in the last meeting that he has a branch internally
      that he'll bring as a PR. The work would add a flatbuffers footer next to
      the thrift one.
      -

      Issue in arrow C++ https://github.com/apache/arrow/issues/43695
      -

   Releases:
   -

      Good to do some cleanup on old apis for 2.0:
      -

         Ex: Nanosecond timestamp: remove old way of annotating types (2
         ways to define logical types
         <https://github.com/apache/parquet-java/pull/1194>)
         -

      Communicate on releases
      -

         1.15: what do we want to include in this release? Fokko to start a
         thread on the mailing list.
         -

         2.0
         -

   Improvements for ML:
   -

      Wide schema for metadata
      -

      Merging sorted files
      -

      adding FIXED_SIZE_LIST type
      -

         https://github.com/apache/parquet-format/pull/241
         -

         Saving would come from faster (de)serialization
         -

      Xuwei: problem reading of space amplification writing RL and DL with
      fixed size list.
      -

         Fixed_size as logical type doesn’t solve it.
         -

      Proposal: stored as a binary value per row.
      -

         Pro: No more RL/DL amplification problem
         -

         Con: Lose benefit of stream split encoding of float number.
         -

         Alternative:
         -

            Official fixed_list_type that doesn’t .
            -

   Action item:
   -

      Julien:
      -

         highlight discussion on metadata footer:
         -

            Xuwei
            -

            Neelaksh
            -

            Alkis
            -

      Fokko:
      -

         follow up on planning 1.15 scope on the mailing list. (Micah also
         interested)
         -

         Start 1.14.2 release.
         -

      Rok:
      -

         follow up on FIXED_SIZE_LIST type discussion
         -

         https://github.com/apache/parquet-format/pull/241

Parquet sync notes Aug 14, 2024

Reply via email to