Re: Pitch for Pcodec Encoding in Parquet

Micah Kornfield Fri, 05 Jan 2024 09:26:42 -0800

>
> I don't believe Apache has any restriction against Rust. We are not
> collectively beholden to any other organization's restrictions, are we?



It is correct that Apache does not have any restrictions.  The point is
mostly about:
1.  Even if there is no restriction, maintainers of Apache projects need to
maintain their own tool chains, adding a new dependency might not be
something they care to take on (I am not actively involved in tool chain
maintenance, but Arrow/Parquet C++ build system is particularly complex due
to the wide range ot systems it targets).
2.  IMO it is important to consider downstream users in these decisions as
well.

Another thing that could help with adoption here is if pcodec had a
specification document (apologies if I missed it), that would allow others
to more easily port it.

Thanks,
Micah


On Fri, Jan 5, 2024 at 5:53 AM Martin Loncaric <m.w.lonca...@gmail.com>
wrote:

> I would make the comparison to byte_stream_split immediately, filtering
> down to only float columns, but looks like it's the one encoding not
> supported by arrow-rs. Seeing if I can get this merged in:
> https://github.com/apache/arrow-rs/pull/4183.
>
> In the meantime I'll see if I can do a compression-ratio-only comparison
> using pyarrow or something.
>
> Micah:
>
> maintainers of parquet don't necessarily
> > have strong influence on all toolchain decisions their organizations may
> > make.
>
>
> I don't believe Apache has any restriction against Rust. We are not
> collectively beholden to any other organization's restrictions, are we?
>
> It does sound like a good idea for you to start publishing Maven packages
> > and other native language bindings to generally expand the reach of your
> > project.
>
>
> Totally agreed. My understanding is that the JVM and C++ implementations
> are most important to support, and other languages can follow (e.g. as they
> have for byte stream split, apparently). Rust<>C++ bindings aren't too hard
> since you only need to build for the target architecture. JNI and some
> others are trickier.
>
> On Fri, Jan 5, 2024 at 6:10 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hello,
> >
> > It would be very interesting to expand the comparison against
> > BYTE_STREAM_SPLIT + compression.
> >
> > See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal
> > to extend the range of types supporting BYTE_STREAM_SPLIT.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Wed, 3 Jan 2024 00:10:14 -0500
> > Martin Loncaric <m.w.lonca...@gmail.com>
> > wrote:
> > > I'd like to propose and get feedback on a new encoding for numerical
> > > columns: pco. I just did a blog post demonstrating how this would
> perform
> > > on various real-world datasets
> > > <https://graphallthethings.com/posts/the-parquet-we-could-have>.
> TL;DR:
> > pco
> > > losslessly achieves much better compression ratio (44-158% higher) and
> > > slightly faster decompression speed than zstd-compressed Parquet. On
> the
> > > other hand, it compresses somewhat slower at default compression level,
> > but
> > > I think this difference may disappear in future updates.
> > >
> > > I think supporting this optional encoding would be an enormous win, but
> > I'm
> > > not blind to the difficulties of implementing it:
> > > * Writing a good JVM implementation would be very difficult, so we'd
> > > probably have to make a JNI library.
> > > * Pco must be compressed one "chunk" (probably one per Parquet data
> page)
> > > at a time, with no way to estimate the encoded size until it has
> already
> > > done >50% of the compression work. I suspect the best solution is to
> > split
> > > pco data pages based on unencoded size, which is different from
> existing
> > > encodings. I think this makes sense since pco fulfills the role usually
> > > played by compression in Parquet.
> > >
> > > Please let me know what you think of this idea.
> > >
> > > Thanks,
> > > Martin
> > >
> >
> >
> >
> >
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to