Last year I sent an initial pitch for adding Pcodec (Pco for short) into Parquet <https://lists.apache.org/thread/ht95wm8trfx2z4pq91t7170t2qjqg4yw>. To re-introduce Pco: it's a codec for numerical sequences that almost always gets higher compression ratio than anything else (very often in the 20-100% improvement range) and has performance on par with or slightly faster than Parquet's existing encodings paired with ZSTD (link to paper <https://arxiv.org/abs/2502.06112>). It goes from (array of numbers -> bytes), as opposed to general-purpose codecs which are (bytes -> bytes). Since there are probably exabytes of numerical Parquet data out there, I think the value of this compression ratio improvement is colossal.
Last year, there were some valid concerns. We've made major progress to address most of them: 1. Newness. Pco had just been released at the time. Now it is used to store petabytes of data in Zarr (a popular tensor format), CnosDB (a time series database), and someone is working on adding it into Postgres. 2. Limited FFI support, especially JVM. At the time, we did not have JVM support, but now we do via JNI (io.github.pcodec/pco-jni). This is currently compiled for the most common Linux, Mac, and Windows architectures, and we can easily add more if needed. This is very similar to how Zstd is used. I was told there's a rule that support in 3 languages is required, and now we have Rust, Python, and JVM (and adding C wouldn't be too hard). 3. Single maintainer. I now have another maintainer (@skielex) who is very engaged, and a variety of other people have made code contributions in the last year. There was one other concern that can't be directly solved, but perhaps we can avoid entirely: 4. Complexity. Parquet's existing encodings are very simple, and it sounds like people would like to keep it that way. Pcodec's complexity is somewhere in between that of an encoding and a compression (e.g. it has 11k LoC, compared to Zstd's 70k). We could frame Pco as a Parquet compression instead of encoding if that alleviates concerns; the big limitation is that it would only work on PLAIN encoding, since it needs the original numbers. There are more details, so we may need another thread about this choice if we decide to move forward. Do people have more questions? What would the next step be? Other links: * format spec <https://github.com/pcodec/pcodec/blob/main/docs/format.md> * old results with Pco hacked into Parquet <https://graphallthethings.com/posts/the-parquet-we-could-have>