Thank you for bringing this up Martin,

The challenge is, as I think we can see from this thread, is finding a
balance between the cost of adding something new to the spec and the number
of users (and thus contributors to help make it happen). If a feature isn't
widely adopted in the ecosystem adding it to the spec just increases
complexity meaninglessly.

While I have no comments on Pco specifically one way or the other, I
believe there *are* several research papers / proposals working their way
through the publishing process that describe 'extensible' encodings (for
example, that allow  an optional machine readable implementation of a
decoder in the file, via WASM). In my opinion this is the most promising
way to extend the parquet format with new, potentially very domain
specific, encodings.

I hinted at this at one of the syncs in October[1] and I hope we'll have
more to share later this year.

Andrew

[1]: https://lists.apache.org/thread/jvdnx7ppkob0yfoty5chqc9bh4f9nxsj

On Tue, Mar 18, 2025 at 12:16 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> While I'm rather lukewarm towards an additional, novel encoding with
> a high implementation complexity, I think your arguments are unfair,
> Alkis.
>
> On Tue, 18 Mar 2025 15:56:08 +0100 Alkis Evlogimenos wrote:
> >
> > From our internal numbers (Databricks) very little data in parquet is
> > numbers. In terms of bytes flowing through the readers (uncompressed) we
> > see the following distribution:
>
> Please remember that Databricks is only one user of Parquet. Just
> because most of your data is BINARY doesn't mean this applies to
> Parquet data around the world.
>
> (for example, Parquet files for machine learning would obviously
> contain many numeric columns)
>
> > In addition to the above distribution we also know the average
> compression
> > ratio for integers with general compressors which is about 1.5x.
>
> Not only we don't know what the actual *average* would be on the entire
> corpus of Parquet files around the world, but an average over an
> unknown statistical distribution has very little information value.
>
> For example, if the average were to be 1.5x, but with an upper decile at
> 20x, then that upper decile would be worth optimizing for (a decile of
> Parquet files is certainly a huge amount of data).
>
> > My stance is that adding future encodings should be gated with a large
> > enough experiment on *real* data showing both efficacy and wide
> > applicability.
>
> Agreed, but which "real" data? :-)
>
> Regards
>
> Antoine.
>
>
>

Reply via email to