Re: Pitch for Pcodec Encoding in Parquet

Antoine Pitrou Fri, 05 Jan 2024 03:10:39 -0800


Hello,


It would be very interesting to expand the comparison against
BYTE_STREAM_SPLIT + compression.

See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal
to extend the range of types supporting BYTE_STREAM_SPLIT.

Regards

Antoine.


On Wed, 3 Jan 2024 00:10:14 -0500
Martin Loncaric <m.w.lonca...@gmail.com>
wrote:
> I'd like to propose and get feedback on a new encoding for numerical
> columns: pco. I just did a blog post demonstrating how this would perform
> on various real-world datasets
> <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR: pco
> losslessly achieves much better compression ratio (44-158% higher) and
> slightly faster decompression speed than zstd-compressed Parquet. On the
> other hand, it compresses somewhat slower at default compression level, but
> I think this difference may disappear in future updates.
> 
> I think supporting this optional encoding would be an enormous win, but I'm
> not blind to the difficulties of implementing it:
> * Writing a good JVM implementation would be very difficult, so we'd
> probably have to make a JNI library.
> * Pco must be compressed one "chunk" (probably one per Parquet data page)
> at a time, with no way to estimate the encoded size until it has already
> done >50% of the compression work. I suspect the best solution is to split
> pco data pages based on unencoded size, which is different from existing
> encodings. I think this makes sense since pco fulfills the role usually
> played by compression in Parquet.
> 
> Please let me know what you think of this idea.
> 
> Thanks,
> Martin
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to