Attendees: -
Julien: Datadog, metadata improvements, encodings. - Vinoo: timeseries. Listening in. Parquet compliance. - Ashish: log analytics, listening in. favorite projects, parquet and arrow. - Claire: Spotify data infra, migrating from avro to parquet. Parquet-avro contributor. 1.14.2 release question. - Dewey: Voltron, Geometry type in C++. Collaborating on Java. - Neelaksh: GResearch MLH fellow benchmarking parquet C++. Perf for ML workloads (10K columns). Appending flatbuffers. - Rok: Fin tech, efficient wide schema metadata, contributed to arrow C++ - Xuwei: database startup. Contributing to c++ parquet module. Arrow-parquet. Listening in. - Fokko: Databricks. Iceberg. Listening in. Agenda: Follow up items: Alkis to pick where on github to push his prototype branch - https://github.com/apache/parquet-format/pull/445 - 1.14.2 release: bugfix for a bug. Related to avro 1.8 - PSA: file-offset in column Chunk is disabled in C++ and rust impl - New metadata: - Appending new footer - Neelakhs’ benchmarking on metadata (de)serialization ( https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1, https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking) - New releases: 1.14.2 - Next 1.15 in september. Meeting notes: - 1.14.2 release <https://github.com/apache/parquet-java/milestone/28>: bugfix for a bug. Related to avro 1.8 - 1.14 bug: API used in parquet avro that existed only on 1.10 and above. Causes exceptions when using with avro 1.8. - Fix by Claire: https://github.com/apache/parquet-java/pull/2957 - Required to use the avro api in 1.14.x for older versions of avro. - Fokko: happy to help with the release - PSA: FYI, file-offset in column Chunk is disabled in C++ and rust impl, if any user relies on it, you can try to check this - https://github.com/apache/arrow/pull/43428 - https://github.com/apache/arrow-rs/pull/6117 - https://github.com/apache/parquet-format/pull/440 - Neelakhs’ benchmarking on metadata (de)serialization https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1, https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking - Working with G research. - Reproducible repository (Jupyter notebook) - Created a Benchmark - Specifically perf of metadata (thrift) when increasing number of columns - Float 32 for ml workloads - Full and partial schema load. - Compression algorithm benchmark. - Proposal for alternate file format. - Evaluate performance of flatbuffers - Converting thrift to flatbuffers. - Append it to the footer. - Parquet reader to parse it. - Next step: - More Flatbuffers benchmarking - Xuwei - About metadata benchmark ( I think their work is interesting): https://github.com/apache/arrow-rs/issues/5770 - https://www.influxdata.com/blog/how-good-parquet-wide-tables/ - Do you think a C++ FlatBuffer Metadata API would help? I can draft one which could extend footer to a outside FlatBuffer - Will try to discuss it with Alkis. - Seems Alkis checked in Scrub in C++ lib, looking forward the future work - Rok: Alkis said in the last meeting that he has a branch internally that he'll bring as a PR. The work would add a flatbuffers footer next to the thrift one. - Issue in arrow C++ https://github.com/apache/arrow/issues/43695 - Releases: - Good to do some cleanup on old apis for 2.0: - Ex: Nanosecond timestamp: remove old way of annotating types (2 ways to define logical types <https://github.com/apache/parquet-java/pull/1194>) - Communicate on releases - 1.15: what do we want to include in this release? Fokko to start a thread on the mailing list. - 2.0 - Improvements for ML: - Wide schema for metadata - Merging sorted files - adding FIXED_SIZE_LIST type - https://github.com/apache/parquet-format/pull/241 - Saving would come from faster (de)serialization - Xuwei: problem reading of space amplification writing RL and DL with fixed size list. - Fixed_size as logical type doesn’t solve it. - Proposal: stored as a binary value per row. - Pro: No more RL/DL amplification problem - Con: Lose benefit of stream split encoding of float number. - Alternative: - Official fixed_list_type that doesn’t . - Action item: - Julien: - highlight discussion on metadata footer: - Xuwei - Neelaksh - Alkis - Fokko: - follow up on planning 1.15 scope on the mailing list. (Micah also interested) - Start 1.14.2 release. - Rok: - follow up on FIXED_SIZE_LIST type discussion - https://github.com/apache/parquet-format/pull/241
