Re: Pitch for Pcodec Encoding in Parquet

Martin Loncaric Sat, 06 Jan 2024 14:10:28 -0800

>
> It would be very interesting to expand the comparison against
> BYTE_STREAM_SPLIT + compression.



Antoine: I created one now, at the bottom of the post
<https://graphallthethings.com/posts/the-parquet-we-could-have>. In this
case, BYTE_STREAM_SPLIT did worse.

parquet-mr is currently a pure Java library.


 Jan: I think that's a misconception; it depends on zstd JNI at least
<https://github.com/apache/parquet-mr/blob/8418b8b60ba8990f476e2d1c07b23aeb9614652a/pom.xml#L104>
.

Another thing that could help with adoption here is if pcodec had a
> specification document
>

Micah and Will: I have a format specification diagram
<https://github.com/mwlon/pcodec#file-format>, but I'll write up the
details more exactly somewhere else.

On Fri, Jan 5, 2024 at 2:47 PM Will Jones <will.jones...@gmail.com> wrote:

> > Another thing that could help with adoption here is if pcodec had a
> > specification document (apologies if I missed it), that would allow
> others
> > to more easily port it.
> >
>
> +1 to this. The encodings right now are described by a spec, rather than
> some specific library. I think if we wanted Pcodec to be integrated into
> Parquet, it should be as a specification for an encoding not as a library.
> If some Parquet implementation wanted to use your implementation of Pcodec,
> that should be a separate decision made by individual Parquet
> implementations. Do you have the specification for the codec written down
> somewhere?
>
> For some added context, there are many Parquet libraries even outside of
> the Apache governance. For example, both Velox and DuckDB have their own
> C++ implementation of Parquet, independent of the Apache C++ one.
>
> On Fri, Jan 5, 2024 at 9:26 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > >
> > > I don't believe Apache has any restriction against Rust. We are not
> > > collectively beholden to any other organization's restrictions, are we?
> >
> >
> > It is correct that Apache does not have any restrictions.  The point is
> > mostly about:
> > 1.  Even if there is no restriction, maintainers of Apache projects need
> to
> > maintain their own tool chains, adding a new dependency might not be
> > something they care to take on (I am not actively involved in tool chain
> > maintenance, but Arrow/Parquet C++ build system is particularly complex
> due
> > to the wide range ot systems it targets).
> > 2.  IMO it is important to consider downstream users in these decisions
> as
> > well.
> >
> > Another thing that could help with adoption here is if pcodec had a
> > specification document (apologies if I missed it), that would allow
> others
> > to more easily port it.
> >
> > Thanks,
> > Micah
> >
> >
> > On Fri, Jan 5, 2024 at 5:53 AM Martin Loncaric <m.w.lonca...@gmail.com>
> > wrote:
> >
> > > I would make the comparison to byte_stream_split immediately, filtering
> > > down to only float columns, but looks like it's the one encoding not
> > > supported by arrow-rs. Seeing if I can get this merged in:
> > > https://github.com/apache/arrow-rs/pull/4183.
> > >
> > > In the meantime I'll see if I can do a compression-ratio-only
> comparison
> > > using pyarrow or something.
> > >
> > > Micah:
> > >
> > > maintainers of parquet don't necessarily
> > > > have strong influence on all toolchain decisions their organizations
> > may
> > > > make.
> > >
> > >
> > > I don't believe Apache has any restriction against Rust. We are not
> > > collectively beholden to any other organization's restrictions, are we?
> > >
> > > It does sound like a good idea for you to start publishing Maven
> packages
> > > > and other native language bindings to generally expand the reach of
> > your
> > > > project.
> > >
> > >
> > > Totally agreed. My understanding is that the JVM and C++
> implementations
> > > are most important to support, and other languages can follow (e.g. as
> > they
> > > have for byte stream split, apparently). Rust<>C++ bindings aren't too
> > hard
> > > since you only need to build for the target architecture. JNI and some
> > > others are trickier.
> > >
> > > On Fri, Jan 5, 2024 at 6:10 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > It would be very interesting to expand the comparison against
> > > > BYTE_STREAM_SPLIT + compression.
> > > >
> > > > See https://issues.apache.org/jira/browse/PARQUET-2414 for a
> proposal
> > > > to extend the range of types supporting BYTE_STREAM_SPLIT.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > On Wed, 3 Jan 2024 00:10:14 -0500
> > > > Martin Loncaric <m.w.lonca...@gmail.com>
> > > > wrote:
> > > > > I'd like to propose and get feedback on a new encoding for
> numerical
> > > > > columns: pco. I just did a blog post demonstrating how this would
> > > perform
> > > > > on various real-world datasets
> > > > > <https://graphallthethings.com/posts/the-parquet-we-could-have>.
> > > TL;DR:
> > > > pco
> > > > > losslessly achieves much better compression ratio (44-158% higher)
> > and
> > > > > slightly faster decompression speed than zstd-compressed Parquet.
> On
> > > the
> > > > > other hand, it compresses somewhat slower at default compression
> > level,
> > > > but
> > > > > I think this difference may disappear in future updates.
> > > > >
> > > > > I think supporting this optional encoding would be an enormous win,
> > but
> > > > I'm
> > > > > not blind to the difficulties of implementing it:
> > > > > * Writing a good JVM implementation would be very difficult, so
> we'd
> > > > > probably have to make a JNI library.
> > > > > * Pco must be compressed one "chunk" (probably one per Parquet data
> > > page)
> > > > > at a time, with no way to estimate the encoded size until it has
> > > already
> > > > > done >50% of the compression work. I suspect the best solution is
> to
> > > > split
> > > > > pco data pages based on unencoded size, which is different from
> > > existing
> > > > > encodings. I think this makes sense since pco fulfills the role
> > usually
> > > > > played by compression in Parquet.
> > > > >
> > > > > Please let me know what you think of this idea.
> > > > >
> > > > > Thanks,
> > > > > Martin
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to