Thank you for the email updates, we read them during the meeting, that was quite useful. Notes:
Attendees: - Gene - Databricks: - Micah - Google - Gabor - Dremio - Fokko - Databricks - Aihua - Snowflake - Raul - QuantStack - Neil - Snowflake - Kenny - HyperParam : author of HyParquet - Julien - Datadog - Antoine - QuantStack - Russell - Snowflake - Rok - Agenda: - Variant update: Gene/Daniel/Andrew/Gabor/Fokko/Rok/Aihua - (email updates from people who could not attend) - Daniel: With respect to the reference implementations for Variant, we had discussed the possibility of Rust or C++, but those both have significant work. The Java and native Python implementations are much closer and should cover the concerns for verification of the spec. I still think there will be work on the Rust side, but I don't think there's a C++ implementation that would be in a state to open source. For the shredding spec, Micah, Ryan, Russel and I met and are closing in on wording that everyone is happy with, so I expect that will close out Shortly. - Andrew: As a brief update, I am working on finding someone to help with the Rust implementation of variant. Moving forward with Java and Python seems reasonable to me, though I would truly love to get a Rust implementation to ensure there is no potential gotcha's for a native implementation. - Gene update: - Variant binary encoding/decoding: - Java implementation under review in parquet-java repo - Python implementation in pyspark. - Pure-python in Spark repo. - Variant shredding: - Still working on the implementation in Spark. - What are the next steps? - TODO: follow up on the mailing list. - How are we releasing Variant? - Current plan: Release Variant + shredding together. - Releasing variant binary first would be a possibility. - Remaining questions: - New types added to the format - Nano timestamp: need clarification on actual semantics. A long cannot store a nanosecond timestamp with a practical range. Year 9999 often used as a special value. (which does not fit in a long) - Avro, Arrow, and Iceberg have a 64bits nano ts. - Limited precision: - 1677-09-21 00:12:43.145224193 - 2262-04-11 23:47:16.854775807]. - Snowflake implementation: default to 8 byte, expandable to 16 bytes. - Variant could support 16 bytes version - Step1: - Support in variant: - 64bits micros ts - 64bits micros ts without timezone - 64bits nano ts - 64bits nanos ts without timezone - Next step: - Gene to follow up with Russel, Ryan, Antoine - Step2: - Add pico seconds ts to Parquet. Define how it’s mapped to native types. - Constraints: - Number of type code - Using 20+ already (including: nano ts, time, UUID) - 6 bits: 64 types maximum. (we might use the last to extend) - Interval: full range of SQL types. - Russell: - Someone at Snowflake will look into the parquet-cpp implementation of Variant. - Binary variant and the shredding - Antoine: - Do we have official Variant test cases to test various implementations? - It would be nice to provide a set of test cases for cross language compatibility. - Variant implementations: - Java implementation (https://github.com/apache/parquet-java/pull/3117) - python PR? https://github.com/apache/spark/pull/49591 - There seem to be roundtrip tests against Spark here: https://github.com/apache/spark/blob/54a59b7f3ceb575e478650ab8ead01922595ea17/python/pyspark/sql/tests/test_types.py#L2060 - Wide schema performance problem: [Antoine] new footer - Interest in this work - Russel also interested. - Need to talk about encryption in the new footer. - Opportunity to improve encryption handling in the footer. - TODO: follow up with Alkis On Wed, Jan 22, 2025 at 9:05 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > I also unfortunately will not be able to make it today. > > As a brief update, I am working on finding someone to help with the Rust > implementation of variant. Moving forward with Java and Python seems > reasonable to me, though I would truly love to get a Rust implementation to > ensure there is no potential gotcha's for a native implementation > > Thanks, > Andrew > > On Wed, Jan 22, 2025 at 11:41 AM Daniel Weeks <dwe...@apache.org> wrote: > > > Hey Julien, > > > > I'm not going to be able to attend today's meeting, but just wanted to > > follow up on a few of the items from the last meeting. > > > > With respect to the reference implementations for Variant, we had > > discussed the possibility of Rust or C++, but those both have significant > > work. The Java and native Python implementations are much closer and > > should cover the concerns for verification of the spec. I still think > > there will be work on the Rust side, but I don't think there's a C++ > > implementation that would be in a state to open source. > > > > For the shredding spec, Micah, Ryan, Russel and I met and are closing in > on > > wording that everyone is happy with, so I expect that will close out > > shortly. > > > > -Dan > > > > On Wed, Jan 22, 2025 at 7:41 AM Julien Le Dem <jul...@apache.org> wrote: > > > > > The next Parquet sync is today Jan 22nd at 9:30am PT - 12:30pm ET - > > 6:30pm > > > CET > > > (in about 2hs) > > > To join the invite: > > > https://calendar.app.google/xXGgYU6evBArpzdZ9 > > > Please contact me to be added to the recurring invite. > > > Everybody is welcome, bring your topic or just listen in. > > > Best > > > Julien > > > > > >