Pitch for Pcodec in Parquet (again)

Martin Loncaric Tue, 18 Mar 2025 07:14:20 -0700

Last year I sent an initial pitch for adding Pcodec (Pco for short) into
Parquet <https://lists.apache.org/thread/ht95wm8trfx2z4pq91t7170t2qjqg4yw>.
To re-introduce Pco: it's a codec for numerical sequences that almost
always gets higher compression ratio than anything else (very often in the
20-100% improvement range) and has performance on par with or slightly
faster than Parquet's existing encodings paired with ZSTD (link to paper
<https://arxiv.org/abs/2502.06112>). It goes from (array of numbers ->
bytes), as opposed to general-purpose codecs which are (bytes -> bytes).
Since there are probably exabytes of numerical Parquet data out there, I
think the value of this compression ratio improvement is colossal.


Last year, there were some valid concerns. We've made major progress to
address most of them:

1. Newness. Pco had just been released at the time. Now it is used to store
petabytes of data in Zarr (a popular tensor format), CnosDB (a time series
database), and someone is working on adding it into Postgres.
2. Limited FFI support, especially JVM. At the time, we did not have JVM
support, but now we do via JNI (io.github.pcodec/pco-jni). This is
currently compiled for the most common Linux, Mac, and Windows
architectures, and we can easily add more if needed. This is very similar
to how Zstd is used. I was told there's a rule that support in 3 languages
is required, and now we have Rust, Python, and JVM (and adding C wouldn't
be too hard).
3. Single maintainer. I now have another maintainer (@skielex) who is very
engaged, and a variety of other people have made code contributions in the
last year.

There was one other concern that can't be directly solved, but perhaps we
can avoid entirely:

4. Complexity. Parquet's existing encodings are very simple, and it sounds
like people would like to keep it that way. Pcodec's complexity is
somewhere in between that of an encoding and a compression (e.g. it has 11k
LoC, compared to Zstd's 70k). We could frame Pco as a Parquet compression
instead of encoding if that alleviates concerns; the big limitation is that
it would only work on PLAIN encoding, since it needs the original numbers.
There are more details, so we may need another thread about this choice if
we decide to move forward.

Do people have more questions? What would the next step be?

Other links:
* format spec <https://github.com/pcodec/pcodec/blob/main/docs/format.md>
* old results with Pco hacked into Parquet
<https://graphallthethings.com/posts/the-parquet-we-could-have>

Pitch for Pcodec in Parquet (again)

Reply via email to