I'd like to propose and get feedback on a new encoding for numerical columns: pco. I just did a blog post demonstrating how this would perform on various real-world datasets <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR: pco losslessly achieves much better compression ratio (44-158% higher) and slightly faster decompression speed than zstd-compressed Parquet. On the other hand, it compresses somewhat slower at default compression level, but I think this difference may disappear in future updates.
I think supporting this optional encoding would be an enormous win, but I'm not blind to the difficulties of implementing it: * Writing a good JVM implementation would be very difficult, so we'd probably have to make a JNI library. * Pco must be compressed one "chunk" (probably one per Parquet data page) at a time, with no way to estimate the encoded size until it has already done >50% of the compression work. I suspect the best solution is to split pco data pages based on unencoded size, which is different from existing encodings. I think this makes sense since pco fulfills the role usually played by compression in Parquet. Please let me know what you think of this idea. Thanks, Martin