Hello,

While I'm rather lukewarm towards an additional, novel encoding with
a high implementation complexity, I think your arguments are unfair,
Alkis.

On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote:
> 
> From our internal numbers (Databricks) very little data in parquet is
> numbers. In terms of bytes flowing through the readers (uncompressed) we
> see the following distribution:

Please remember that Databricks is only one user of Parquet. Just
because most of your data is BINARY doesn't mean this applies to
Parquet data around the world.

(for example, Parquet files for machine learning would obviously
contain many numeric columns)

> In addition to the above distribution we also know the average compression
> ratio for integers with general compressors which is about 1.5x.

Not only we don't know what the actual *average* would be on the entire
corpus of Parquet files around the world, but an average over an
unknown statistical distribution has very little information value.

For example, if the average were to be 1.5x, but with an upper decile at
20x, then that upper decile would be worth optimizing for (a decile of
Parquet files is certainly a huge amount of data).

> My stance is that adding future encodings should be gated with a large
> enough experiment on *real* data showing both efficacy and wide
> applicability.

Agreed, but which "real" data? :-)

Regards

Antoine.


Reply via email to