Hello,
While I'm rather lukewarm towards an additional, novel encoding with a high implementation complexity, I think your arguments are unfair, Alkis. On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote: > > From our internal numbers (Databricks) very little data in parquet is > numbers. In terms of bytes flowing through the readers (uncompressed) we > see the following distribution: Please remember that Databricks is only one user of Parquet. Just because most of your data is BINARY doesn't mean this applies to Parquet data around the world. (for example, Parquet files for machine learning would obviously contain many numeric columns) > In addition to the above distribution we also know the average compression > ratio for integers with general compressors which is about 1.5x. Not only we don't know what the actual *average* would be on the entire corpus of Parquet files around the world, but an average over an unknown statistical distribution has very little information value. For example, if the average were to be 1.5x, but with an upper decile at 20x, then that upper decile would be worth optimizing for (a decile of Parquet files is certainly a huge amount of data). > My stance is that adding future encodings should be gated with a large > enough experiment on *real* data showing both efficacy and wide > applicability. Agreed, but which "real" data? :-) Regards Antoine.