One of the things we worked on last year was to add binary protocol
extensions to parquet. This is a way to embed arbitrary metadata to any
thrift serialized blob in parquet in a way that old readers maintain
backwards compatibility:
https://github.com/apache/parquet-format/blob/master/BinaryProtocolExtensions.md

Using this scheme, one can implement Pco, encode numeric columns twice, one
with Pco and one with standard encodings (if backwards compatibility is
needed). Readers that know about Pco will read the Pco data, old readers
will read the compatibility data. If compatibility is not needed one can
even drop the second encoding. Once this is in place, run it experimentally
in some production environment and report back on the findings.

This is what we are doing with the new footer: we embed another version of
the footer in the footer itself and do experiments on real world data
before we propose to add this to the format.

I feel it would be more productive to discuss this in the sync because I
see that you are making wrong conclusions about the data I presented:

> a. A very likely explanation is that Pco simply compresses data better
than what Databricks customers are doing. This is a good thing in Pco's
favor.

This doesn't follow. The data you chose to benchmark Pco with, compress 5x
with *zstd*. The data I observe through all of Databricks compress 1.5x
with *zstd*. These numbers have nothing to do with how well Pco compresses
anything. They clearly show that the benchmark data you picked are not
representative of Databricks data.

> Even if somehow none of our 6 publicly available, columnar datasets are
representative of Databricks' data compressibility, it can still be
representative of Databricks' data patterns.

Maybe for a subset of the 7%. But that's speculation.

> Even if Databricks data exhibits different patterns not found in any of
our benchmarks, Pco can still handle them better.

This is also speculation. How do we know Pco handles all numeric data
better than zstd? As far as we know, Pco ran only the 6 publicly available
datasets and does a 20% better than zstd on data that zstd can compress by
5x. We do not know how well Pco does on data that has high entropy and
compresses 1.5x.

> Since your data is proprietary, the onus is on you to demonstrate any
properties it has empirically.

The bar to add new encodings is/should be very high. Implementing them is
very costly (many languages, many engines) and removing them is virtually
impossible. Since you are proposing for everyone to incur this cost (and it
is a very high cost - 1/6th of LoC of zstd - none of the other encodings
are anywhere close to this complexity) it is only logical that the onus
falls on you. It would be quite unfortunate if the mode operandi for
parquet format evolution is for engine owners to implement every proposed
encoding to prove/disprove their worth in practice.



On Tue, Mar 18, 2025 at 9:50 PM Martin Loncaric
<mlonca...@janestreet.com.invalid> wrote:

> *Alkis:*
>
> One of your arguments is that (only 7% of decompressed parquet data in
> Databricks is numerical => better numerical compression isn't worthwhile).
> Problems with that argument:
> * Even 7% of Parquet data is a tremendous amount.
> * Compressed, a much higher proportion of data is numerical. As you said
> yourself, numerical data in Databricks only averages a compression ratio of
> only 1.5. Strings are usually much more compressible.
> * As you mentioned, that's only incoming data, and stored data could be
> different.
> * Databricks isn't representative of all use cases.
>
> Another is that (Pco gets high compression ratio on benchmark datasets, and
> Databricks averages low compression ratio on its data ==(a)==> Pco's
> benchmarks aren't representative of Databricks data ==(b)==> Pco isn't good
> on Databricks data). Problems with this argument:
> a. A very likely explanation is that Pco simply compresses data better than
> what Databricks customers are doing. This is a good thing in Pco's favor.
> b.
>   * Even if somehow none of our 6 publicly available, columnar datasets are
> representative of Databricks' data compressibility, it can still be
> representative of Databricks' data patterns.
>   * Even if Databricks data exhibits different patterns not found in any of
> our benchmarks, Pco can still handle them better.
>   * Since your data is proprietary, the onus is on you to demonstrate any
> properties it has empirically.
>
> *ZOOMING OUT*:
>
> I think we can all agree that better compression for numerical data is a
> good thing, and that Parquet should adopt newer techniques if they improve
> substantially on the status quo. I welcome any benchmarks if people would
> like to evaluate this for their own data; the pcodec CLI makes this very
> easy.
>
> Complexity seems to be the main sticking point. I would argue that a 20%
> higher compression ratio is a huge enough win to warrant a slight increase
> in complexity (Pco is <1/6 the size of zstd in LoC). To my knowledge,
> storage (not to mention network load) is the most important cost factor in
> Parquet today. I'm happy to discuss this compression/complexity tradeoff
> more in this thread or at the Parquet meet tomorrow.
>
> On Tue, Mar 18, 2025 at 2:08 PM Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.invalid> wrote:
>
> > Just to clarify, the distribution numbers I shared earlier represent data
> > after decompression, not data stored without compression. I now
> understand
> > that "uncompressed" may have been misinterpreted. In addition I want to
> > point out that these are data that go through the readers, so this is
> > skewed towards reads and not towards what is at rest. Lastly, the data
> > comes from tracking across all Databricks customers, encompassing a wide
> > variety of datasets, including financial, ML, and more. Therefore, I
> > believe this mix is fairly representative of a "world average" skewed
> > towards read performance.
> >
> > Regarding the average compression ratio of ~1.5 for integers. It is true
> > that it does not tell us a lot but I disagree that we cannot make
> > conclusions without the full distribution. We know the compression
> > Databricks observes across INT64+INT32 data is 1.5x (it is similar for
> > float as well). We know pco benchmarks are against data that compresses
> > 4-5x with general compressors. This means the pco benchmark data is not
> > representative of the data we observe in the Databricks set. This means
> the
> > performance of pco on the selected benchmark data is not a good predictor
> > of what it will do in the Databricks data. And if we believe Databricks
> > data is closer to representative of the world data, then it is not a good
> > predictor of what pco will do for the world.
> >
> > At the end it boils down to which dataset you think is more
> representative
> > of the world data. Put qualitatively, does the world (numeric) data
> > look more like taxi+air quality+reddit or more like the data of a mix of
> > the thousands of customers around the world that happen to use Databricks
> > for their data and AI needs?
> >
> >
> > On Tue, Mar 18, 2025 at 5:16 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > While I'm rather lukewarm towards an additional, novel encoding with
> > > a high implementation complexity, I think your arguments are unfair,
> > > Alkis.
> > >
> > > On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote:
> > > >
> > > > From our internal numbers (Databricks) very little data in parquet is
> > > > numbers. In terms of bytes flowing through the readers (uncompressed)
> > we
> > > > see the following distribution:
> > >
> > > Please remember that Databricks is only one user of Parquet. Just
> > > because most of your data is BINARY doesn't mean this applies to
> > > Parquet data around the world.
> > >
> > > (for example, Parquet files for machine learning would obviously
> > > contain many numeric columns)
> > >
> > > > In addition to the above distribution we also know the average
> > > compression
> > > > ratio for integers with general compressors which is about 1.5x.
> > >
> > > Not only we don't know what the actual *average* would be on the entire
> > > corpus of Parquet files around the world, but an average over an
> > > unknown statistical distribution has very little information value.
> > >
> > > For example, if the average were to be 1.5x, but with an upper decile
> at
> > > 20x, then that upper decile would be worth optimizing for (a decile of
> > > Parquet files is certainly a huge amount of data).
> > >
> > > > My stance is that adding future encodings should be gated with a
> large
> > > > enough experiment on *real* data showing both efficacy and wide
> > > > applicability.
> > >
> > > Agreed, but which "real" data? :-)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Reply via email to