I'd like to propose and get feedback on a new encoding for numerical
columns: pco. I just did a blog post demonstrating how this would perform
on various real-world datasets
<https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR: pco
losslessly achieves much better compression ratio (44-158% higher) and
slightly faster decompression speed than zstd-compressed Parquet. On the
other hand, it compresses somewhat slower at default compression level, but
I think this difference may disappear in future updates.

I think supporting this optional encoding would be an enormous win, but I'm
not blind to the difficulties of implementing it:
* Writing a good JVM implementation would be very difficult, so we'd
probably have to make a JNI library.
* Pco must be compressed one "chunk" (probably one per Parquet data page)
at a time, with no way to estimate the encoded size until it has already
done >50% of the compression work. I suspect the best solution is to split
pco data pages based on unencoded size, which is different from existing
encodings. I think this makes sense since pco fulfills the role usually
played by compression in Parquet.

Please let me know what you think of this idea.

Thanks,
Martin

Reply via email to