Re: Pitch for Pcodec Encoding in Parquet

Martin Loncaric Wed, 03 Jan 2024 11:15:20 -0800

Hi Micah,

I think point 2 is valid. Pcodec has a small but growing community of
users. I'm not sure where the bar is for age or establishment, but it's
reasonably if pcodec doesn't meet it yet.


I think point 1 is misguided. Not every organization uses Rust, but Rust
programs are known to be much safer than C/C++. Rust compiles down to
native dynamic libraries, so the interop with JNI is identical to that of
C/C++. One does not need the Rust toolchain to use a jar built with Rust
sources.

Thanks.

On Wed, Jan 3, 2024, 13:43 Micah Kornfield <emkornfi...@gmail.com> wrote:

> Hi Martin,
> The results are impressive. However I'll point you to a recent prior
> discussion on a proposed new encoding/compression technique
> <https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv> [1].
> While this seems to avoid the lossiness concerns. There are also suggested
> benchmarks to use for comparison.
>
> I think there are still two issues that apply here:
>
> 1.  Requiring a Rust Tool Chain (apologies but this seems to be Rust only
> at the moment) and FFI for Java and other non-Rust implementations I think
> makes it much harder for other implementations to adopt this encoding.
>  For instance, my organization does not currently allow Rust code in
> production.
> 2.  This seems like something relatively new and not well established in
> the ecosystem, giving it a higher risk of on-going support.
>
> Thanks,
> Micah
>
>
> [1] https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv
>
> On Tue, Jan 2, 2024 at 9:10 PM Martin Loncaric <m.w.lonca...@gmail.com>
> wrote:
>
> > I'd like to propose and get feedback on a new encoding for numerical
> > columns: pco. I just did a blog post demonstrating how this would perform
> > on various real-world datasets
> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR:
> > pco
> > losslessly achieves much better compression ratio (44-158% higher) and
> > slightly faster decompression speed than zstd-compressed Parquet. On the
> > other hand, it compresses somewhat slower at default compression level,
> but
> > I think this difference may disappear in future updates.
> >
> > I think supporting this optional encoding would be an enormous win, but
> I'm
> > not blind to the difficulties of implementing it:
> > * Writing a good JVM implementation would be very difficult, so we'd
> > probably have to make a JNI library.
> > * Pco must be compressed one "chunk" (probably one per Parquet data page)
> > at a time, with no way to estimate the encoded size until it has already
> > done >50% of the compression work. I suspect the best solution is to
> split
> > pco data pages based on unencoded size, which is different from existing
> > encodings. I think this makes sense since pco fulfills the role usually
> > played by compression in Parquet.
> >
> > Please let me know what you think of this idea.
> >
> > Thanks,
> > Martin
> >
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to