> > It would be very interesting to expand the comparison against > BYTE_STREAM_SPLIT + compression.
Antoine: I created one now, at the bottom of the post <https://graphallthethings.com/posts/the-parquet-we-could-have>. In this case, BYTE_STREAM_SPLIT did worse. parquet-mr is currently a pure Java library. Jan: I think that's a misconception; it depends on zstd JNI at least <https://github.com/apache/parquet-mr/blob/8418b8b60ba8990f476e2d1c07b23aeb9614652a/pom.xml#L104> . Another thing that could help with adoption here is if pcodec had a > specification document > Micah and Will: I have a format specification diagram <https://github.com/mwlon/pcodec#file-format>, but I'll write up the details more exactly somewhere else. On Fri, Jan 5, 2024 at 2:47 PM Will Jones <will.jones...@gmail.com> wrote: > > Another thing that could help with adoption here is if pcodec had a > > specification document (apologies if I missed it), that would allow > others > > to more easily port it. > > > > +1 to this. The encodings right now are described by a spec, rather than > some specific library. I think if we wanted Pcodec to be integrated into > Parquet, it should be as a specification for an encoding not as a library. > If some Parquet implementation wanted to use your implementation of Pcodec, > that should be a separate decision made by individual Parquet > implementations. Do you have the specification for the codec written down > somewhere? > > For some added context, there are many Parquet libraries even outside of > the Apache governance. For example, both Velox and DuckDB have their own > C++ implementation of Parquet, independent of the Apache C++ one. > > On Fri, Jan 5, 2024 at 9:26 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > > > > I don't believe Apache has any restriction against Rust. We are not > > > collectively beholden to any other organization's restrictions, are we? > > > > > > It is correct that Apache does not have any restrictions. The point is > > mostly about: > > 1. Even if there is no restriction, maintainers of Apache projects need > to > > maintain their own tool chains, adding a new dependency might not be > > something they care to take on (I am not actively involved in tool chain > > maintenance, but Arrow/Parquet C++ build system is particularly complex > due > > to the wide range ot systems it targets). > > 2. IMO it is important to consider downstream users in these decisions > as > > well. > > > > Another thing that could help with adoption here is if pcodec had a > > specification document (apologies if I missed it), that would allow > others > > to more easily port it. > > > > Thanks, > > Micah > > > > > > On Fri, Jan 5, 2024 at 5:53 AM Martin Loncaric <m.w.lonca...@gmail.com> > > wrote: > > > > > I would make the comparison to byte_stream_split immediately, filtering > > > down to only float columns, but looks like it's the one encoding not > > > supported by arrow-rs. Seeing if I can get this merged in: > > > https://github.com/apache/arrow-rs/pull/4183. > > > > > > In the meantime I'll see if I can do a compression-ratio-only > comparison > > > using pyarrow or something. > > > > > > Micah: > > > > > > maintainers of parquet don't necessarily > > > > have strong influence on all toolchain decisions their organizations > > may > > > > make. > > > > > > > > > I don't believe Apache has any restriction against Rust. We are not > > > collectively beholden to any other organization's restrictions, are we? > > > > > > It does sound like a good idea for you to start publishing Maven > packages > > > > and other native language bindings to generally expand the reach of > > your > > > > project. > > > > > > > > > Totally agreed. My understanding is that the JVM and C++ > implementations > > > are most important to support, and other languages can follow (e.g. as > > they > > > have for byte stream split, apparently). Rust<>C++ bindings aren't too > > hard > > > since you only need to build for the target architecture. JNI and some > > > others are trickier. > > > > > > On Fri, Jan 5, 2024 at 6:10 AM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > > > Hello, > > > > > > > > It would be very interesting to expand the comparison against > > > > BYTE_STREAM_SPLIT + compression. > > > > > > > > See https://issues.apache.org/jira/browse/PARQUET-2414 for a > proposal > > > > to extend the range of types supporting BYTE_STREAM_SPLIT. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > On Wed, 3 Jan 2024 00:10:14 -0500 > > > > Martin Loncaric <m.w.lonca...@gmail.com> > > > > wrote: > > > > > I'd like to propose and get feedback on a new encoding for > numerical > > > > > columns: pco. I just did a blog post demonstrating how this would > > > perform > > > > > on various real-world datasets > > > > > <https://graphallthethings.com/posts/the-parquet-we-could-have>. > > > TL;DR: > > > > pco > > > > > losslessly achieves much better compression ratio (44-158% higher) > > and > > > > > slightly faster decompression speed than zstd-compressed Parquet. > On > > > the > > > > > other hand, it compresses somewhat slower at default compression > > level, > > > > but > > > > > I think this difference may disappear in future updates. > > > > > > > > > > I think supporting this optional encoding would be an enormous win, > > but > > > > I'm > > > > > not blind to the difficulties of implementing it: > > > > > * Writing a good JVM implementation would be very difficult, so > we'd > > > > > probably have to make a JNI library. > > > > > * Pco must be compressed one "chunk" (probably one per Parquet data > > > page) > > > > > at a time, with no way to estimate the encoded size until it has > > > already > > > > > done >50% of the compression work. I suspect the best solution is > to > > > > split > > > > > pco data pages based on unencoded size, which is different from > > > existing > > > > > encodings. I think this makes sense since pco fulfills the role > > usually > > > > > played by compression in Parquet. > > > > > > > > > > Please let me know what you think of this idea. > > > > > > > > > > Thanks, > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > > > >