Re: Pitch for Pcodec in Parquet (again)

Alkis Evlogimenos Tue, 18 Mar 2025 07:56:28 -0700

As Curt said, how does it compare with ALP is the obvious question - both
in terms of compression/decompression speeds and compression ratio?


>From our internal numbers (Databricks) very little data in parquet is
numbers. In terms of bytes flowing through the readers (uncompressed) we
see the following distribution:
BINARY               92.26%
INT96                 1.95%
INT64                 1.94%
INT32                 1.29%
DOUBLE                1.24%
FIXED_LEN_BYTE_ARRAY  1.24%
FLOAT                 0.07%
BOOLEAN               0.02%

Given the above, is the additional complexity of pco justified to optimize
~3% of data out there (assume what we observe is a representative sample of
the worlds data)?

In addition to the above distribution we also know the average compression
ratio for integers with general compressors which is about 1.5x. Granted
most parquet files are snappy compressed with zstd gaining traction but it
looks like most integer data is not amenable to compression by general
compression schemes. This does not align with the data used in the
experiments for pco posted here
https://graphallthethings.com/posts/the-parquet-we-could-have. This raises
the concern that the data used to benchmark pco is rather niche, which
means that pco might even apply to much less than the 3% of total parquet
data of the distribution above. This would make the case of pulling pco in
the spec even harder.

My stance is that adding future encodings should be gated with a large
enough experiment on *real* data showing both efficacy and wide
applicability.

On Tue, Mar 18, 2025 at 3:22 PM Curt Hagenlocher <c...@hagenlocher.org>
wrote:

> Out of curiosity, how does this compare with ALP in terms of size and
> speed?
>
> On Tue, Mar 18, 2025 at 7:14 AM Martin Loncaric
> <mlonca...@janestreet.com.invalid> wrote:
>
> > Last year I sent an initial pitch for adding Pcodec (Pco for short) into
> > Parquet <
> https://lists.apache.org/thread/ht95wm8trfx2z4pq91t7170t2qjqg4yw
> > >.
> > To re-introduce Pco: it's a codec for numerical sequences that almost
> > always gets higher compression ratio than anything else (very often in
> the
> > 20-100% improvement range) and has performance on par with or slightly
> > faster than Parquet's existing encodings paired with ZSTD (link to paper
> > <https://arxiv.org/abs/2502.06112>). It goes from (array of numbers ->
> > bytes), as opposed to general-purpose codecs which are (bytes -> bytes).
> > Since there are probably exabytes of numerical Parquet data out there, I
> > think the value of this compression ratio improvement is colossal.
> >
> > Last year, there were some valid concerns. We've made major progress to
> > address most of them:
> >
> > 1. Newness. Pco had just been released at the time. Now it is used to
> store
> > petabytes of data in Zarr (a popular tensor format), CnosDB (a time
> series
> > database), and someone is working on adding it into Postgres.
> > 2. Limited FFI support, especially JVM. At the time, we did not have JVM
> > support, but now we do via JNI (io.github.pcodec/pco-jni). This is
> > currently compiled for the most common Linux, Mac, and Windows
> > architectures, and we can easily add more if needed. This is very similar
> > to how Zstd is used. I was told there's a rule that support in 3
> languages
> > is required, and now we have Rust, Python, and JVM (and adding C wouldn't
> > be too hard).
> > 3. Single maintainer. I now have another maintainer (@skielex) who is
> very
> > engaged, and a variety of other people have made code contributions in
> the
> > last year.
> >
> > There was one other concern that can't be directly solved, but perhaps we
> > can avoid entirely:
> >
> > 4. Complexity. Parquet's existing encodings are very simple, and it
> sounds
> > like people would like to keep it that way. Pcodec's complexity is
> > somewhere in between that of an encoding and a compression (e.g. it has
> 11k
> > LoC, compared to Zstd's 70k). We could frame Pco as a Parquet compression
> > instead of encoding if that alleviates concerns; the big limitation is
> that
> > it would only work on PLAIN encoding, since it needs the original
> numbers.
> > There are more details, so we may need another thread about this choice
> if
> > we decide to move forward.
> >
> > Do people have more questions? What would the next step be?
> >
> > Other links:
> > * format spec <https://github.com/pcodec/pcodec/blob/main/docs/format.md
> >
> > * old results with Pco hacked into Parquet
> > <https://graphallthethings.com/posts/the-parquet-we-could-have>
> >
>

Re: Pitch for Pcodec in Parquet (again)

Reply via email to