Hi Micah, I think point 2 is valid. Pcodec has a small but growing community of users. I'm not sure where the bar is for age or establishment, but it's reasonably if pcodec doesn't meet it yet.
I think point 1 is misguided. Not every organization uses Rust, but Rust programs are known to be much safer than C/C++. Rust compiles down to native dynamic libraries, so the interop with JNI is identical to that of C/C++. One does not need the Rust toolchain to use a jar built with Rust sources. Thanks. On Wed, Jan 3, 2024, 13:43 Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Martin, > The results are impressive. However I'll point you to a recent prior > discussion on a proposed new encoding/compression technique > <https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv> [1]. > While this seems to avoid the lossiness concerns. There are also suggested > benchmarks to use for comparison. > > I think there are still two issues that apply here: > > 1. Requiring a Rust Tool Chain (apologies but this seems to be Rust only > at the moment) and FFI for Java and other non-Rust implementations I think > makes it much harder for other implementations to adopt this encoding. > For instance, my organization does not currently allow Rust code in > production. > 2. This seems like something relatively new and not well established in > the ecosystem, giving it a higher risk of on-going support. > > Thanks, > Micah > > > [1] https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv > > On Tue, Jan 2, 2024 at 9:10 PM Martin Loncaric <m.w.lonca...@gmail.com> > wrote: > > > I'd like to propose and get feedback on a new encoding for numerical > > columns: pco. I just did a blog post demonstrating how this would perform > > on various real-world datasets > > <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR: > > pco > > losslessly achieves much better compression ratio (44-158% higher) and > > slightly faster decompression speed than zstd-compressed Parquet. On the > > other hand, it compresses somewhat slower at default compression level, > but > > I think this difference may disappear in future updates. > > > > I think supporting this optional encoding would be an enormous win, but > I'm > > not blind to the difficulties of implementing it: > > * Writing a good JVM implementation would be very difficult, so we'd > > probably have to make a JNI library. > > * Pco must be compressed one "chunk" (probably one per Parquet data page) > > at a time, with no way to estimate the encoded size until it has already > > done >50% of the compression work. I suspect the best solution is to > split > > pco data pages based on unencoded size, which is different from existing > > encodings. I think this makes sense since pco fulfills the role usually > > played by compression in Parquet. > > > > Please let me know what you think of this idea. > > > > Thanks, > > Martin > > >