Re: Pitch for Pcodec Encoding in Parquet

Micah Kornfield Thu, 04 Jan 2024 11:42:09 -0800

>
> Not every organization uses Rust, but Rust
> programs are known to be much safer than C/C++.



I agree, but I think this is irrelevant to the point of building from
source (expanded upon below) and maintainers of parquet don't necessarily
have strong influence on all toolchain decisions their organizations may
make.


> Rust compiles down to
> native dynamic libraries, so the interop with JNI is identical to that of
> C/C++. One does not need the Rust toolchain to use a jar built with Rust
> sources.


I agree Rust can be packaged into libraries for consumers, but this still
requires the work of packaging and someone needs to build it from source.
Java is not the only implementation of Parquet.  There are multiple other
language bindings that exist (a non-exhaustive list: Parquet C++ hosted in
Arrow, Impala, Golang, etc).  In some cases these are compiled from source
for packaging purposes.  I'm not an expert in all of these but presumably
it will take some amount of effort for each of these if source builds are
required.

It does sound like a good idea for you to start publishing Maven packages
and other native language bindings to generally expand the reach of your
project.

Thanks,
Micah



On Wed, Jan 3, 2024 at 11:15 AM Martin Loncaric <m.w.lonca...@gmail.com>
wrote:

> Hi Micah,
>
> I think point 2 is valid. Pcodec has a small but growing community of
> users. I'm not sure where the bar is for age or establishment, but it's
> reasonably if pcodec doesn't meet it yet.
>
> I think point 1 is misguided. Not every organization uses Rust, but Rust
> programs are known to be much safer than C/C++. Rust compiles down to
> native dynamic libraries, so the interop with JNI is identical to that of
> C/C++. One does not need the Rust toolchain to use a jar built with Rust
> sources.
>
> Thanks.
>
> On Wed, Jan 3, 2024, 13:43 Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> > Hi Martin,
> > The results are impressive. However I'll point you to a recent prior
> > discussion on a proposed new encoding/compression technique
> > <https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv> [1].
> > While this seems to avoid the lossiness concerns. There are also
> suggested
> > benchmarks to use for comparison.
> >
> > I think there are still two issues that apply here:
> >
> > 1.  Requiring a Rust Tool Chain (apologies but this seems to be Rust only
> > at the moment) and FFI for Java and other non-Rust implementations I
> think
> > makes it much harder for other implementations to adopt this encoding.
> >  For instance, my organization does not currently allow Rust code in
> > production.
> > 2.  This seems like something relatively new and not well established in
> > the ecosystem, giving it a higher risk of on-going support.
> >
> > Thanks,
> > Micah
> >
> >
> > [1] https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv
> >
> > On Tue, Jan 2, 2024 at 9:10 PM Martin Loncaric <m.w.lonca...@gmail.com>
> > wrote:
> >
> > > I'd like to propose and get feedback on a new encoding for numerical
> > > columns: pco. I just did a blog post demonstrating how this would
> perform
> > > on various real-world datasets
> > > <https://graphallthethings.com/posts/the-parquet-we-could-have>.
> TL;DR:
> > > pco
> > > losslessly achieves much better compression ratio (44-158% higher) and
> > > slightly faster decompression speed than zstd-compressed Parquet. On
> the
> > > other hand, it compresses somewhat slower at default compression level,
> > but
> > > I think this difference may disappear in future updates.
> > >
> > > I think supporting this optional encoding would be an enormous win, but
> > I'm
> > > not blind to the difficulties of implementing it:
> > > * Writing a good JVM implementation would be very difficult, so we'd
> > > probably have to make a JNI library.
> > > * Pco must be compressed one "chunk" (probably one per Parquet data
> page)
> > > at a time, with no way to estimate the encoded size until it has
> already
> > > done >50% of the compression work. I suspect the best solution is to
> > split
> > > pco data pages based on unencoded size, which is different from
> existing
> > > encodings. I think this makes sense since pco fulfills the role usually
> > > played by compression in Parquet.
> > >
> > > Please let me know what you think of this idea.
> > >
> > > Thanks,
> > > Martin
> > >
> >
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to