Hello,
It would be very interesting to expand the comparison against BYTE_STREAM_SPLIT + compression. See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal to extend the range of types supporting BYTE_STREAM_SPLIT. Regards Antoine. On Wed, 3 Jan 2024 00:10:14 -0500 Martin Loncaric <m.w.lonca...@gmail.com> wrote: > I'd like to propose and get feedback on a new encoding for numerical > columns: pco. I just did a blog post demonstrating how this would perform > on various real-world datasets > <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR: pco > losslessly achieves much better compression ratio (44-158% higher) and > slightly faster decompression speed than zstd-compressed Parquet. On the > other hand, it compresses somewhat slower at default compression level, but > I think this difference may disappear in future updates. > > I think supporting this optional encoding would be an enormous win, but I'm > not blind to the difficulties of implementing it: > * Writing a good JVM implementation would be very difficult, so we'd > probably have to make a JNI library. > * Pco must be compressed one "chunk" (probably one per Parquet data page) > at a time, with no way to estimate the encoded size until it has already > done >50% of the compression work. I suspect the best solution is to split > pco data pages based on unencoded size, which is different from existing > encodings. I think this makes sense since pco fulfills the role usually > played by compression in Parquet. > > Please let me know what you think of this idea. > > Thanks, > Martin >