Just to clarify, the distribution numbers I shared earlier represent data
after decompression, not data stored without compression. I now understand
that "uncompressed" may have been misinterpreted. In addition I want to
point out that these are data that go through the readers, so this is
skewed towards reads and not towards what is at rest. Lastly, the data
comes from tracking across all Databricks customers, encompassing a wide
variety of datasets, including financial, ML, and more. Therefore, I
believe this mix is fairly representative of a "world average" skewed
towards read performance.

Regarding the average compression ratio of ~1.5 for integers. It is true
that it does not tell us a lot but I disagree that we cannot make
conclusions without the full distribution. We know the compression
Databricks observes across INT64+INT32 data is 1.5x (it is similar for
float as well). We know pco benchmarks are against data that compresses
4-5x with general compressors. This means the pco benchmark data is not
representative of the data we observe in the Databricks set. This means the
performance of pco on the selected benchmark data is not a good predictor
of what it will do in the Databricks data. And if we believe Databricks
data is closer to representative of the world data, then it is not a good
predictor of what pco will do for the world.

At the end it boils down to which dataset you think is more representative
of the world data. Put qualitatively, does the world (numeric) data
look more like taxi+air quality+reddit or more like the data of a mix of
the thousands of customers around the world that happen to use Databricks
for their data and AI needs?


On Tue, Mar 18, 2025 at 5:16 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> While I'm rather lukewarm towards an additional, novel encoding with
> a high implementation complexity, I think your arguments are unfair,
> Alkis.
>
> On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote:
> >
> > From our internal numbers (Databricks) very little data in parquet is
> > numbers. In terms of bytes flowing through the readers (uncompressed) we
> > see the following distribution:
>
> Please remember that Databricks is only one user of Parquet. Just
> because most of your data is BINARY doesn't mean this applies to
> Parquet data around the world.
>
> (for example, Parquet files for machine learning would obviously
> contain many numeric columns)
>
> > In addition to the above distribution we also know the average
> compression
> > ratio for integers with general compressors which is about 1.5x.
>
> Not only we don't know what the actual *average* would be on the entire
> corpus of Parquet files around the world, but an average over an
> unknown statistical distribution has very little information value.
>
> For example, if the average were to be 1.5x, but with an upper decile at
> 20x, then that upper decile would be worth optimizing for (a decile of
> Parquet files is certainly a huge amount of data).
>
> > My stance is that adding future encodings should be gated with a large
> > enough experiment on *real* data showing both efficacy and wide
> > applicability.
>
> Agreed, but which "real" data? :-)
>
> Regards
>
> Antoine.
>
>
>

Reply via email to