👋 I happened to work with Paul building InfluxDB 3.0 so I have some perspective on this matter
One example of "Parquet being a standard like SQL is standard" we hit is nanosecond timestamps. While they were added to the spec in 2018[1] they still aren't supported by major systems (ahem Spark, Iceberg) which means we can't interoperate with those systems unless we rewrite data to use millisecond timestamps. We hit the same issue with `BYTE_STREAM_SPLIT` encoding[2] which wasn't supported by pyspark as I remember (even though it was quite effective for our timestamp data). This is entirely an ecosystem / people / momentum problem in my opinion (not a technical one) I have some thoughts [3] on how to help (compatibility matrix, define what "compatible" means, etc) but I haven't been able to get people excited about it 🤷) Hope that offers some context, Andrew [1] https://github.com/apache/parquet-format/pull/102 [2]: https://github.com/apache/parquet-format/pull/144 [3]: https://github.com/apache/parquet-format/issues/441 On Tue, Nov 19, 2024 at 1:50 PM Steve Loughran <ste...@cloudera.com.invalid> wrote: > I watched a really good video recently where Paul Dix described the > iterations of influx DB with really open reviews about the issues they've > encountered > > One comment "Parquet is a standard like SQL is a standard" called out > inconsistent handling of Parquet data between their service and others. > > https://youtu.be/AGS4GNGDK_4?si=27CvjJ5O69cI8xZZ&t=3867 > > Does anyone know what problems were encountered? Are there open issues? And > sample files? > > I'm curious about this stuff -and debugging some of this seems a really > good way to learn about the lower level details of the formats and how > libraries work with that. > > Nobody wants to be accused as a "standard like SQL" in a way that is not > praise. >