I would make the comparison to byte_stream_split immediately, filtering down to only float columns, but looks like it's the one encoding not supported by arrow-rs. Seeing if I can get this merged in: https://github.com/apache/arrow-rs/pull/4183.
In the meantime I'll see if I can do a compression-ratio-only comparison using pyarrow or something. Micah: maintainers of parquet don't necessarily > have strong influence on all toolchain decisions their organizations may > make. I don't believe Apache has any restriction against Rust. We are not collectively beholden to any other organization's restrictions, are we? It does sound like a good idea for you to start publishing Maven packages > and other native language bindings to generally expand the reach of your > project. Totally agreed. My understanding is that the JVM and C++ implementations are most important to support, and other languages can follow (e.g. as they have for byte stream split, apparently). Rust<>C++ bindings aren't too hard since you only need to build for the target architecture. JNI and some others are trickier. On Fri, Jan 5, 2024 at 6:10 AM Antoine Pitrou <anto...@python.org> wrote: > > Hello, > > It would be very interesting to expand the comparison against > BYTE_STREAM_SPLIT + compression. > > See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal > to extend the range of types supporting BYTE_STREAM_SPLIT. > > Regards > > Antoine. > > > On Wed, 3 Jan 2024 00:10:14 -0500 > Martin Loncaric <m.w.lonca...@gmail.com> > wrote: > > I'd like to propose and get feedback on a new encoding for numerical > > columns: pco. I just did a blog post demonstrating how this would perform > > on various real-world datasets > > <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR: > pco > > losslessly achieves much better compression ratio (44-158% higher) and > > slightly faster decompression speed than zstd-compressed Parquet. On the > > other hand, it compresses somewhat slower at default compression level, > but > > I think this difference may disappear in future updates. > > > > I think supporting this optional encoding would be an enormous win, but > I'm > > not blind to the difficulties of implementing it: > > * Writing a good JVM implementation would be very difficult, so we'd > > probably have to make a JNI library. > > * Pco must be compressed one "chunk" (probably one per Parquet data page) > > at a time, with no way to estimate the encoded size until it has already > > done >50% of the compression work. I suspect the best solution is to > split > > pco data pages based on unencoded size, which is different from existing > > encodings. I think this makes sense since pco fulfills the role usually > > played by compression in Parquet. > > > > Please let me know what you think of this idea. > > > > Thanks, > > Martin > > > > > >