Re: Pitch for Pcodec in Parquet (again)

Martin Loncaric Sat, 05 Apr 2025 10:11:09 -0700

*Curt:*

> Out of curiosity, how does this compare with ALP in terms of size and
> speed?


I hadn't seen ALP before - looks like it's extremely new. From skimming
their repo, I think you can expect Pco's compression ratio to be much
higher, but with much slower speed. ALP is probably better for in-memory
data transfers, whereas Pco would be better for data read over the network
or stored on disk. ALP seems to only apply to floats.

*Alkis:*

> From our internal numbers (Databricks) very little data in parquet is
> numbers...

Strings typically compress much better than numbers (dictionary benefits
them a lot more), so this distribution changes a lot when looking at
compressed data. Also, Databricks caters to a particular sort of Parquet
user. For instance, in the financial industry, almost all Parquet data
(even uncompressed) is numerical.

is the additional complexity of pco justified to optimize ~3% of data out
> there...

Even with the uncompressed Databricks accounting above, that's ~7% of the
data (all int or float types, and FIXED_LEN_BYTE_ARRAY is often float16).

it looks like most integer data is not amenable to compression by general
> compression schemes...

If Pco improves compression ratio from 1.5 -> 1.8, then it's even more
valuable than improving compression ratio from 5->6.

This raises the concern that the data used to benchmark pco is rather niche

We compare against 6 datasets in the paper. If you're skeptical, then I
encourage you to try it on some data (cargo install pco_cli; pcodec bench
-i /path/to/parquet -c pco,parquet:compression=zstd1 --iters 1)

On Tue, Mar 18, 2025 at 10:56 AM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

> As Curt said, how does it compare with ALP is the obvious question - both
> in terms of compression/decompression speeds and compression ratio?
>
> From our internal numbers (Databricks) very little data in parquet is
> numbers. In terms of bytes flowing through the readers (uncompressed) we
> see the following distribution:
> BINARY               92.26%
> INT96                 1.95%
> INT64                 1.94%
> INT32                 1.29%
> DOUBLE                1.24%
> FIXED_LEN_BYTE_ARRAY  1.24%
> FLOAT                 0.07%
> BOOLEAN               0.02%
>
> Given the above, is the additional complexity of pco justified to optimize
> ~3% of data out there (assume what we observe is a representative sample of
> the worlds data)?
>
> In addition to the above distribution we also know the average compression
> ratio for integers with general compressors which is about 1.5x. Granted
> most parquet files are snappy compressed with zstd gaining traction but it
> looks like most integer data is not amenable to compression by general
> compression schemes. This does not align with the data used in the
> experiments for pco posted here
> https://graphallthethings.com/posts/the-parquet-we-could-have. This raises
> the concern that the data used to benchmark pco is rather niche, which
> means that pco might even apply to much less than the 3% of total parquet
> data of the distribution above. This would make the case of pulling pco in
> the spec even harder.
>
> My stance is that adding future encodings should be gated with a large
> enough experiment on *real* data showing both efficacy and wide
> applicability.
>
> On Tue, Mar 18, 2025 at 3:22 PM Curt Hagenlocher <c...@hagenlocher.org>
> wrote:
>
> > Out of curiosity, how does this compare with ALP in terms of size and
> > speed?
> >
> > On Tue, Mar 18, 2025 at 7:14 AM Martin Loncaric
> > <mlonca...@janestreet.com.invalid> wrote:
> >
> > > Last year I sent an initial pitch for adding Pcodec (Pco for short)
> into
> > > Parquet <
> > https://lists.apache.org/thread/ht95wm8trfx2z4pq91t7170t2qjqg4yw
> > > >.
> > > To re-introduce Pco: it's a codec for numerical sequences that almost
> > > always gets higher compression ratio than anything else (very often in
> > the
> > > 20-100% improvement range) and has performance on par with or slightly
> > > faster than Parquet's existing encodings paired with ZSTD (link to
> paper
> > > <https://arxiv.org/abs/2502.06112>). It goes from (array of numbers ->
> > > bytes), as opposed to general-purpose codecs which are (bytes ->
> bytes).
> > > Since there are probably exabytes of numerical Parquet data out there,
> I
> > > think the value of this compression ratio improvement is colossal.
> > >
> > > Last year, there were some valid concerns. We've made major progress to
> > > address most of them:
> > >
> > > 1. Newness. Pco had just been released at the time. Now it is used to
> > store
> > > petabytes of data in Zarr (a popular tensor format), CnosDB (a time
> > series
> > > database), and someone is working on adding it into Postgres.
> > > 2. Limited FFI support, especially JVM. At the time, we did not have
> JVM
> > > support, but now we do via JNI (io.github.pcodec/pco-jni). This is
> > > currently compiled for the most common Linux, Mac, and Windows
> > > architectures, and we can easily add more if needed. This is very
> similar
> > > to how Zstd is used. I was told there's a rule that support in 3
> > languages
> > > is required, and now we have Rust, Python, and JVM (and adding C
> wouldn't
> > > be too hard).
> > > 3. Single maintainer. I now have another maintainer (@skielex) who is
> > very
> > > engaged, and a variety of other people have made code contributions in
> > the
> > > last year.
> > >
> > > There was one other concern that can't be directly solved, but perhaps
> we
> > > can avoid entirely:
> > >
> > > 4. Complexity. Parquet's existing encodings are very simple, and it
> > sounds
> > > like people would like to keep it that way. Pcodec's complexity is
> > > somewhere in between that of an encoding and a compression (e.g. it has
> > 11k
> > > LoC, compared to Zstd's 70k). We could frame Pco as a Parquet
> compression
> > > instead of encoding if that alleviates concerns; the big limitation is
> > that
> > > it would only work on PLAIN encoding, since it needs the original
> > numbers.
> > > There are more details, so we may need another thread about this choice
> > if
> > > we decide to move forward.
> > >
> > > Do people have more questions? What would the next step be?
> > >
> > > Other links:
> > > * format spec <
> https://github.com/pcodec/pcodec/blob/main/docs/format.md
> > >
> > > * old results with Pco hacked into Parquet
> > > <https://graphallthethings.com/posts/the-parquet-we-could-have>
> > >
> >
>

Re: Pitch for Pcodec in Parquet (again)

Reply via email to