Out of curiosity, how does this compare with ALP in terms of size and speed?
On Tue, Mar 18, 2025 at 7:14 AM Martin Loncaric <mlonca...@janestreet.com.invalid> wrote: > Last year I sent an initial pitch for adding Pcodec (Pco for short) into > Parquet <https://lists.apache.org/thread/ht95wm8trfx2z4pq91t7170t2qjqg4yw > >. > To re-introduce Pco: it's a codec for numerical sequences that almost > always gets higher compression ratio than anything else (very often in the > 20-100% improvement range) and has performance on par with or slightly > faster than Parquet's existing encodings paired with ZSTD (link to paper > <https://arxiv.org/abs/2502.06112>). It goes from (array of numbers -> > bytes), as opposed to general-purpose codecs which are (bytes -> bytes). > Since there are probably exabytes of numerical Parquet data out there, I > think the value of this compression ratio improvement is colossal. > > Last year, there were some valid concerns. We've made major progress to > address most of them: > > 1. Newness. Pco had just been released at the time. Now it is used to store > petabytes of data in Zarr (a popular tensor format), CnosDB (a time series > database), and someone is working on adding it into Postgres. > 2. Limited FFI support, especially JVM. At the time, we did not have JVM > support, but now we do via JNI (io.github.pcodec/pco-jni). This is > currently compiled for the most common Linux, Mac, and Windows > architectures, and we can easily add more if needed. This is very similar > to how Zstd is used. I was told there's a rule that support in 3 languages > is required, and now we have Rust, Python, and JVM (and adding C wouldn't > be too hard). > 3. Single maintainer. I now have another maintainer (@skielex) who is very > engaged, and a variety of other people have made code contributions in the > last year. > > There was one other concern that can't be directly solved, but perhaps we > can avoid entirely: > > 4. Complexity. Parquet's existing encodings are very simple, and it sounds > like people would like to keep it that way. Pcodec's complexity is > somewhere in between that of an encoding and a compression (e.g. it has 11k > LoC, compared to Zstd's 70k). We could frame Pco as a Parquet compression > instead of encoding if that alleviates concerns; the big limitation is that > it would only work on PLAIN encoding, since it needs the original numbers. > There are more details, so we may need another thread about this choice if > we decide to move forward. > > Do people have more questions? What would the next step be? > > Other links: > * format spec <https://github.com/pcodec/pcodec/blob/main/docs/format.md> > * old results with Pco hacked into Parquet > <https://graphallthethings.com/posts/the-parquet-we-could-have> >