*Alkis:* One of your arguments is that (only 7% of decompressed parquet data in Databricks is numerical => better numerical compression isn't worthwhile). Problems with that argument: * Even 7% of Parquet data is a tremendous amount. * Compressed, a much higher proportion of data is numerical. As you said yourself, numerical data in Databricks only averages a compression ratio of only 1.5. Strings are usually much more compressible. * As you mentioned, that's only incoming data, and stored data could be different. * Databricks isn't representative of all use cases.
Another is that (Pco gets high compression ratio on benchmark datasets, and Databricks averages low compression ratio on its data ==(a)==> Pco's benchmarks aren't representative of Databricks data ==(b)==> Pco isn't good on Databricks data). Problems with this argument: a. A very likely explanation is that Pco simply compresses data better than what Databricks customers are doing. This is a good thing in Pco's favor. b. * Even if somehow none of our 6 publicly available, columnar datasets are representative of Databricks' data compressibility, it can still be representative of Databricks' data patterns. * Even if Databricks data exhibits different patterns not found in any of our benchmarks, Pco can still handle them better. * Since your data is proprietary, the onus is on you to demonstrate any properties it has empirically. *ZOOMING OUT*: I think we can all agree that better compression for numerical data is a good thing, and that Parquet should adopt newer techniques if they improve substantially on the status quo. I welcome any benchmarks if people would like to evaluate this for their own data; the pcodec CLI makes this very easy. Complexity seems to be the main sticking point. I would argue that a 20% higher compression ratio is a huge enough win to warrant a slight increase in complexity (Pco is <1/6 the size of zstd in LoC). To my knowledge, storage (not to mention network load) is the most important cost factor in Parquet today. I'm happy to discuss this compression/complexity tradeoff more in this thread or at the Parquet meet tomorrow. On Tue, Mar 18, 2025 at 2:08 PM Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > Just to clarify, the distribution numbers I shared earlier represent data > after decompression, not data stored without compression. I now understand > that "uncompressed" may have been misinterpreted. In addition I want to > point out that these are data that go through the readers, so this is > skewed towards reads and not towards what is at rest. Lastly, the data > comes from tracking across all Databricks customers, encompassing a wide > variety of datasets, including financial, ML, and more. Therefore, I > believe this mix is fairly representative of a "world average" skewed > towards read performance. > > Regarding the average compression ratio of ~1.5 for integers. It is true > that it does not tell us a lot but I disagree that we cannot make > conclusions without the full distribution. We know the compression > Databricks observes across INT64+INT32 data is 1.5x (it is similar for > float as well). We know pco benchmarks are against data that compresses > 4-5x with general compressors. This means the pco benchmark data is not > representative of the data we observe in the Databricks set. This means the > performance of pco on the selected benchmark data is not a good predictor > of what it will do in the Databricks data. And if we believe Databricks > data is closer to representative of the world data, then it is not a good > predictor of what pco will do for the world. > > At the end it boils down to which dataset you think is more representative > of the world data. Put qualitatively, does the world (numeric) data > look more like taxi+air quality+reddit or more like the data of a mix of > the thousands of customers around the world that happen to use Databricks > for their data and AI needs? > > > On Tue, Mar 18, 2025 at 5:16 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hello, > > > > While I'm rather lukewarm towards an additional, novel encoding with > > a high implementation complexity, I think your arguments are unfair, > > Alkis. > > > > On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote: > > > > > > From our internal numbers (Databricks) very little data in parquet is > > > numbers. In terms of bytes flowing through the readers (uncompressed) > we > > > see the following distribution: > > > > Please remember that Databricks is only one user of Parquet. Just > > because most of your data is BINARY doesn't mean this applies to > > Parquet data around the world. > > > > (for example, Parquet files for machine learning would obviously > > contain many numeric columns) > > > > > In addition to the above distribution we also know the average > > compression > > > ratio for integers with general compressors which is about 1.5x. > > > > Not only we don't know what the actual *average* would be on the entire > > corpus of Parquet files around the world, but an average over an > > unknown statistical distribution has very little information value. > > > > For example, if the average were to be 1.5x, but with an upper decile at > > 20x, then that upper decile would be worth optimizing for (a decile of > > Parquet files is certainly a huge amount of data). > > > > > My stance is that adding future encodings should be gated with a large > > > enough experiment on *real* data showing both efficacy and wide > > > applicability. > > > > Agreed, but which "real" data? :-) > > > > Regards > > > > Antoine. > > > > > > >