Re: Pitch for Pcodec Encoding in Parquet

Martin Loncaric Fri, 05 Jan 2024 05:53:05 -0800

I would make the comparison to byte_stream_split immediately, filtering
down to only float columns, but looks like it's the one encoding not
supported by arrow-rs. Seeing if I can get this merged in:
https://github.com/apache/arrow-rs/pull/4183.


In the meantime I'll see if I can do a compression-ratio-only comparison
using pyarrow or something.

Micah:

maintainers of parquet don't necessarily
> have strong influence on all toolchain decisions their organizations may
> make.


I don't believe Apache has any restriction against Rust. We are not
collectively beholden to any other organization's restrictions, are we?

It does sound like a good idea for you to start publishing Maven packages
> and other native language bindings to generally expand the reach of your
> project.


Totally agreed. My understanding is that the JVM and C++ implementations
are most important to support, and other languages can follow (e.g. as they
have for byte stream split, apparently). Rust<>C++ bindings aren't too hard
since you only need to build for the target architecture. JNI and some
others are trickier.

On Fri, Jan 5, 2024 at 6:10 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> It would be very interesting to expand the comparison against
> BYTE_STREAM_SPLIT + compression.
>
> See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal
> to extend the range of types supporting BYTE_STREAM_SPLIT.
>
> Regards
>
> Antoine.
>
>
> On Wed, 3 Jan 2024 00:10:14 -0500
> Martin Loncaric <m.w.lonca...@gmail.com>
> wrote:
> > I'd like to propose and get feedback on a new encoding for numerical
> > columns: pco. I just did a blog post demonstrating how this would perform
> > on various real-world datasets
> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR:
> pco
> > losslessly achieves much better compression ratio (44-158% higher) and
> > slightly faster decompression speed than zstd-compressed Parquet. On the
> > other hand, it compresses somewhat slower at default compression level,
> but
> > I think this difference may disappear in future updates.
> >
> > I think supporting this optional encoding would be an enormous win, but
> I'm
> > not blind to the difficulties of implementing it:
> > * Writing a good JVM implementation would be very difficult, so we'd
> > probably have to make a JNI library.
> > * Pco must be compressed one "chunk" (probably one per Parquet data page)
> > at a time, with no way to estimate the encoded size until it has already
> > done >50% of the compression work. I suspect the best solution is to
> split
> > pco data pages based on unencoded size, which is different from existing
> > encodings. I think this makes sense since pco fulfills the role usually
> > played by compression in Parquet.
> >
> > Please let me know what you think of this idea.
> >
> > Thanks,
> > Martin
> >
>
>
>
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to