Thank you for bringing this up Martin, The challenge is, as I think we can see from this thread, is finding a balance between the cost of adding something new to the spec and the number of users (and thus contributors to help make it happen). If a feature isn't widely adopted in the ecosystem adding it to the spec just increases complexity meaninglessly.
While I have no comments on Pco specifically one way or the other, I believe there *are* several research papers / proposals working their way through the publishing process that describe 'extensible' encodings (for example, that allow an optional machine readable implementation of a decoder in the file, via WASM). In my opinion this is the most promising way to extend the parquet format with new, potentially very domain specific, encodings. I hinted at this at one of the syncs in October[1] and I hope we'll have more to share later this year. Andrew [1]: https://lists.apache.org/thread/jvdnx7ppkob0yfoty5chqc9bh4f9nxsj On Tue, Mar 18, 2025 at 12:16 PM Antoine Pitrou <anto...@python.org> wrote: > > Hello, > > While I'm rather lukewarm towards an additional, novel encoding with > a high implementation complexity, I think your arguments are unfair, > Alkis. > > On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote: > > > > From our internal numbers (Databricks) very little data in parquet is > > numbers. In terms of bytes flowing through the readers (uncompressed) we > > see the following distribution: > > Please remember that Databricks is only one user of Parquet. Just > because most of your data is BINARY doesn't mean this applies to > Parquet data around the world. > > (for example, Parquet files for machine learning would obviously > contain many numeric columns) > > > In addition to the above distribution we also know the average > compression > > ratio for integers with general compressors which is about 1.5x. > > Not only we don't know what the actual *average* would be on the entire > corpus of Parquet files around the world, but an average over an > unknown statistical distribution has very little information value. > > For example, if the average were to be 1.5x, but with an upper decile at > 20x, then that upper decile would be worth optimizing for (a decile of > Parquet files is certainly a huge amount of data). > > > My stance is that adding future encodings should be gated with a large > > enough experiment on *real* data showing both efficacy and wide > > applicability. > > Agreed, but which "real" data? :-) > > Regards > > Antoine. > > >