*Curt:* > Out of curiosity, how does this compare with ALP in terms of size and > speed?
I hadn't seen ALP before - looks like it's extremely new. From skimming their repo, I think you can expect Pco's compression ratio to be much higher, but with much slower speed. ALP is probably better for in-memory data transfers, whereas Pco would be better for data read over the network or stored on disk. ALP seems to only apply to floats. *Alkis:* > From our internal numbers (Databricks) very little data in parquet is > numbers... Strings typically compress much better than numbers (dictionary benefits them a lot more), so this distribution changes a lot when looking at compressed data. Also, Databricks caters to a particular sort of Parquet user. For instance, in the financial industry, almost all Parquet data (even uncompressed) is numerical. is the additional complexity of pco justified to optimize ~3% of data out > there... Even with the uncompressed Databricks accounting above, that's ~7% of the data (all int or float types, and FIXED_LEN_BYTE_ARRAY is often float16). it looks like most integer data is not amenable to compression by general > compression schemes... If Pco improves compression ratio from 1.5 -> 1.8, then it's even more valuable than improving compression ratio from 5->6. This raises the concern that the data used to benchmark pco is rather niche We compare against 6 datasets in the paper. If you're skeptical, then I encourage you to try it on some data (cargo install pco_cli; pcodec bench -i /path/to/parquet -c pco,parquet:compression=zstd1 --iters 1) On Tue, Mar 18, 2025 at 10:56 AM Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > As Curt said, how does it compare with ALP is the obvious question - both > in terms of compression/decompression speeds and compression ratio? > > From our internal numbers (Databricks) very little data in parquet is > numbers. In terms of bytes flowing through the readers (uncompressed) we > see the following distribution: > BINARY 92.26% > INT96 1.95% > INT64 1.94% > INT32 1.29% > DOUBLE 1.24% > FIXED_LEN_BYTE_ARRAY 1.24% > FLOAT 0.07% > BOOLEAN 0.02% > > Given the above, is the additional complexity of pco justified to optimize > ~3% of data out there (assume what we observe is a representative sample of > the worlds data)? > > In addition to the above distribution we also know the average compression > ratio for integers with general compressors which is about 1.5x. Granted > most parquet files are snappy compressed with zstd gaining traction but it > looks like most integer data is not amenable to compression by general > compression schemes. This does not align with the data used in the > experiments for pco posted here > https://graphallthethings.com/posts/the-parquet-we-could-have. This raises > the concern that the data used to benchmark pco is rather niche, which > means that pco might even apply to much less than the 3% of total parquet > data of the distribution above. This would make the case of pulling pco in > the spec even harder. > > My stance is that adding future encodings should be gated with a large > enough experiment on *real* data showing both efficacy and wide > applicability. > > On Tue, Mar 18, 2025 at 3:22 PM Curt Hagenlocher <c...@hagenlocher.org> > wrote: > > > Out of curiosity, how does this compare with ALP in terms of size and > > speed? > > > > On Tue, Mar 18, 2025 at 7:14 AM Martin Loncaric > > <mlonca...@janestreet.com.invalid> wrote: > > > > > Last year I sent an initial pitch for adding Pcodec (Pco for short) > into > > > Parquet < > > https://lists.apache.org/thread/ht95wm8trfx2z4pq91t7170t2qjqg4yw > > > >. > > > To re-introduce Pco: it's a codec for numerical sequences that almost > > > always gets higher compression ratio than anything else (very often in > > the > > > 20-100% improvement range) and has performance on par with or slightly > > > faster than Parquet's existing encodings paired with ZSTD (link to > paper > > > <https://arxiv.org/abs/2502.06112>). It goes from (array of numbers -> > > > bytes), as opposed to general-purpose codecs which are (bytes -> > bytes). > > > Since there are probably exabytes of numerical Parquet data out there, > I > > > think the value of this compression ratio improvement is colossal. > > > > > > Last year, there were some valid concerns. We've made major progress to > > > address most of them: > > > > > > 1. Newness. Pco had just been released at the time. Now it is used to > > store > > > petabytes of data in Zarr (a popular tensor format), CnosDB (a time > > series > > > database), and someone is working on adding it into Postgres. > > > 2. Limited FFI support, especially JVM. At the time, we did not have > JVM > > > support, but now we do via JNI (io.github.pcodec/pco-jni). This is > > > currently compiled for the most common Linux, Mac, and Windows > > > architectures, and we can easily add more if needed. This is very > similar > > > to how Zstd is used. I was told there's a rule that support in 3 > > languages > > > is required, and now we have Rust, Python, and JVM (and adding C > wouldn't > > > be too hard). > > > 3. Single maintainer. I now have another maintainer (@skielex) who is > > very > > > engaged, and a variety of other people have made code contributions in > > the > > > last year. > > > > > > There was one other concern that can't be directly solved, but perhaps > we > > > can avoid entirely: > > > > > > 4. Complexity. Parquet's existing encodings are very simple, and it > > sounds > > > like people would like to keep it that way. Pcodec's complexity is > > > somewhere in between that of an encoding and a compression (e.g. it has > > 11k > > > LoC, compared to Zstd's 70k). We could frame Pco as a Parquet > compression > > > instead of encoding if that alleviates concerns; the big limitation is > > that > > > it would only work on PLAIN encoding, since it needs the original > > numbers. > > > There are more details, so we may need another thread about this choice > > if > > > we decide to move forward. > > > > > > Do people have more questions? What would the next step be? > > > > > > Other links: > > > * format spec < > https://github.com/pcodec/pcodec/blob/main/docs/format.md > > > > > > * old results with Pco hacked into Parquet > > > <https://graphallthethings.com/posts/the-parquet-we-could-have> > > > > > >