*Alkis:*

One of your arguments is that (only 7% of decompressed parquet data in
Databricks is numerical => better numerical compression isn't worthwhile).
Problems with that argument:
* Even 7% of Parquet data is a tremendous amount.
* Compressed, a much higher proportion of data is numerical. As you said
yourself, numerical data in Databricks only averages a compression ratio of
only 1.5. Strings are usually much more compressible.
* As you mentioned, that's only incoming data, and stored data could be
different.
* Databricks isn't representative of all use cases.

Another is that (Pco gets high compression ratio on benchmark datasets, and
Databricks averages low compression ratio on its data ==(a)==> Pco's
benchmarks aren't representative of Databricks data ==(b)==> Pco isn't good
on Databricks data). Problems with this argument:
a. A very likely explanation is that Pco simply compresses data better than
what Databricks customers are doing. This is a good thing in Pco's favor.
b.
  * Even if somehow none of our 6 publicly available, columnar datasets are
representative of Databricks' data compressibility, it can still be
representative of Databricks' data patterns.
  * Even if Databricks data exhibits different patterns not found in any of
our benchmarks, Pco can still handle them better.
  * Since your data is proprietary, the onus is on you to demonstrate any
properties it has empirically.

*ZOOMING OUT*:

I think we can all agree that better compression for numerical data is a
good thing, and that Parquet should adopt newer techniques if they improve
substantially on the status quo. I welcome any benchmarks if people would
like to evaluate this for their own data; the pcodec CLI makes this very
easy.

Complexity seems to be the main sticking point. I would argue that a 20%
higher compression ratio is a huge enough win to warrant a slight increase
in complexity (Pco is <1/6 the size of zstd in LoC). To my knowledge,
storage (not to mention network load) is the most important cost factor in
Parquet today. I'm happy to discuss this compression/complexity tradeoff
more in this thread or at the Parquet meet tomorrow.

On Tue, Mar 18, 2025 at 2:08 PM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

> Just to clarify, the distribution numbers I shared earlier represent data
> after decompression, not data stored without compression. I now understand
> that "uncompressed" may have been misinterpreted. In addition I want to
> point out that these are data that go through the readers, so this is
> skewed towards reads and not towards what is at rest. Lastly, the data
> comes from tracking across all Databricks customers, encompassing a wide
> variety of datasets, including financial, ML, and more. Therefore, I
> believe this mix is fairly representative of a "world average" skewed
> towards read performance.
>
> Regarding the average compression ratio of ~1.5 for integers. It is true
> that it does not tell us a lot but I disagree that we cannot make
> conclusions without the full distribution. We know the compression
> Databricks observes across INT64+INT32 data is 1.5x (it is similar for
> float as well). We know pco benchmarks are against data that compresses
> 4-5x with general compressors. This means the pco benchmark data is not
> representative of the data we observe in the Databricks set. This means the
> performance of pco on the selected benchmark data is not a good predictor
> of what it will do in the Databricks data. And if we believe Databricks
> data is closer to representative of the world data, then it is not a good
> predictor of what pco will do for the world.
>
> At the end it boils down to which dataset you think is more representative
> of the world data. Put qualitatively, does the world (numeric) data
> look more like taxi+air quality+reddit or more like the data of a mix of
> the thousands of customers around the world that happen to use Databricks
> for their data and AI needs?
>
>
> On Tue, Mar 18, 2025 at 5:16 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hello,
> >
> > While I'm rather lukewarm towards an additional, novel encoding with
> > a high implementation complexity, I think your arguments are unfair,
> > Alkis.
> >
> > On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote:
> > >
> > > From our internal numbers (Databricks) very little data in parquet is
> > > numbers. In terms of bytes flowing through the readers (uncompressed)
> we
> > > see the following distribution:
> >
> > Please remember that Databricks is only one user of Parquet. Just
> > because most of your data is BINARY doesn't mean this applies to
> > Parquet data around the world.
> >
> > (for example, Parquet files for machine learning would obviously
> > contain many numeric columns)
> >
> > > In addition to the above distribution we also know the average
> > compression
> > > ratio for integers with general compressors which is about 1.5x.
> >
> > Not only we don't know what the actual *average* would be on the entire
> > corpus of Parquet files around the world, but an average over an
> > unknown statistical distribution has very little information value.
> >
> > For example, if the average were to be 1.5x, but with an upper decile at
> > 20x, then that upper decile would be worth optimizing for (a decile of
> > Parquet files is certainly a huge amount of data).
> >
> > > My stance is that adding future encodings should be gated with a large
> > > enough experiment on *real* data showing both efficacy and wide
> > > applicability.
> >
> > Agreed, but which "real" data? :-)
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Reply via email to