Re: Pitch for Pcodec in Parquet (again)

Curt Hagenlocher Sat, 05 Apr 2025 10:30:05 -0700

Out of curiosity, how does this compare with ALP in terms of size and speed?


On Tue, Mar 18, 2025 at 7:14 AM Martin Loncaric
<mlonca...@janestreet.com.invalid> wrote:

> Last year I sent an initial pitch for adding Pcodec (Pco for short) into
> Parquet <https://lists.apache.org/thread/ht95wm8trfx2z4pq91t7170t2qjqg4yw
> >.
> To re-introduce Pco: it's a codec for numerical sequences that almost
> always gets higher compression ratio than anything else (very often in the
> 20-100% improvement range) and has performance on par with or slightly
> faster than Parquet's existing encodings paired with ZSTD (link to paper
> <https://arxiv.org/abs/2502.06112>). It goes from (array of numbers ->
> bytes), as opposed to general-purpose codecs which are (bytes -> bytes).
> Since there are probably exabytes of numerical Parquet data out there, I
> think the value of this compression ratio improvement is colossal.
>
> Last year, there were some valid concerns. We've made major progress to
> address most of them:
>
> 1. Newness. Pco had just been released at the time. Now it is used to store
> petabytes of data in Zarr (a popular tensor format), CnosDB (a time series
> database), and someone is working on adding it into Postgres.
> 2. Limited FFI support, especially JVM. At the time, we did not have JVM
> support, but now we do via JNI (io.github.pcodec/pco-jni). This is
> currently compiled for the most common Linux, Mac, and Windows
> architectures, and we can easily add more if needed. This is very similar
> to how Zstd is used. I was told there's a rule that support in 3 languages
> is required, and now we have Rust, Python, and JVM (and adding C wouldn't
> be too hard).
> 3. Single maintainer. I now have another maintainer (@skielex) who is very
> engaged, and a variety of other people have made code contributions in the
> last year.
>
> There was one other concern that can't be directly solved, but perhaps we
> can avoid entirely:
>
> 4. Complexity. Parquet's existing encodings are very simple, and it sounds
> like people would like to keep it that way. Pcodec's complexity is
> somewhere in between that of an encoding and a compression (e.g. it has 11k
> LoC, compared to Zstd's 70k). We could frame Pco as a Parquet compression
> instead of encoding if that alleviates concerns; the big limitation is that
> it would only work on PLAIN encoding, since it needs the original numbers.
> There are more details, so we may need another thread about this choice if
> we decide to move forward.
>
> Do people have more questions? What would the next step be?
>
> Other links:
> * format spec <https://github.com/pcodec/pcodec/blob/main/docs/format.md>
> * old results with Pco hacked into Parquet
> <https://graphallthethings.com/posts/the-parquet-we-could-have>
>

Re: Pitch for Pcodec in Parquet (again)

Reply via email to