Just to clarify, the distribution numbers I shared earlier represent data after decompression, not data stored without compression. I now understand that "uncompressed" may have been misinterpreted. In addition I want to point out that these are data that go through the readers, so this is skewed towards reads and not towards what is at rest. Lastly, the data comes from tracking across all Databricks customers, encompassing a wide variety of datasets, including financial, ML, and more. Therefore, I believe this mix is fairly representative of a "world average" skewed towards read performance.
Regarding the average compression ratio of ~1.5 for integers. It is true that it does not tell us a lot but I disagree that we cannot make conclusions without the full distribution. We know the compression Databricks observes across INT64+INT32 data is 1.5x (it is similar for float as well). We know pco benchmarks are against data that compresses 4-5x with general compressors. This means the pco benchmark data is not representative of the data we observe in the Databricks set. This means the performance of pco on the selected benchmark data is not a good predictor of what it will do in the Databricks data. And if we believe Databricks data is closer to representative of the world data, then it is not a good predictor of what pco will do for the world. At the end it boils down to which dataset you think is more representative of the world data. Put qualitatively, does the world (numeric) data look more like taxi+air quality+reddit or more like the data of a mix of the thousands of customers around the world that happen to use Databricks for their data and AI needs? On Tue, Mar 18, 2025 at 5:16 PM Antoine Pitrou <anto...@python.org> wrote: > > Hello, > > While I'm rather lukewarm towards an additional, novel encoding with > a high implementation complexity, I think your arguments are unfair, > Alkis. > > On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote: > > > > From our internal numbers (Databricks) very little data in parquet is > > numbers. In terms of bytes flowing through the readers (uncompressed) we > > see the following distribution: > > Please remember that Databricks is only one user of Parquet. Just > because most of your data is BINARY doesn't mean this applies to > Parquet data around the world. > > (for example, Parquet files for machine learning would obviously > contain many numeric columns) > > > In addition to the above distribution we also know the average > compression > > ratio for integers with general compressors which is about 1.5x. > > Not only we don't know what the actual *average* would be on the entire > corpus of Parquet files around the world, but an average over an > unknown statistical distribution has very little information value. > > For example, if the average were to be 1.5x, but with an upper decile at > 20x, then that upper decile would be worth optimizing for (a decile of > Parquet files is certainly a huge amount of data). > > > My stance is that adding future encodings should be gated with a large > > enough experiment on *real* data showing both efficacy and wide > > applicability. > > Agreed, but which "real" data? :-) > > Regards > > Antoine. > > >