Re: Pitch for Pcodec Encoding in Parquet

Micah Kornfield Thu, 11 Jan 2024 20:47:54 -0800

>
> Pco could technically work as a Parquet encoding, but people are wary of
> its newness and weak FFI support. It seems there is no immediate action to
> take, but would be worthwhile to consider this again further in the future.



I guess I'm more optimistic on the potential gaps.  I think if there were a
spec that allowed one to code it from scratch, I'd be willing to take a
crack at seeing what it would take for another implementation in either
Java or C++. (I looked at the links you provided but they were somewhat too
high-level).  I think having a spec would also guard against the "newness"
concern.

I can't say there wouldn't be other technical blockers but at least this
would be someplace to start?

Cheers,
Micah

On Thu, Jan 11, 2024 at 7:21 PM Martin Loncaric <m.w.lonca...@gmail.com>
wrote:

> (Oops, the repeating binary decimal is 1100... with period 4, so exactly 2
> bits of entropy for the 52 mantissa bits. The argument is the same though.)
>
> On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric <m.w.lonca...@gmail.com>
> wrote:
>
> > To reach a conclusion on this thread, I understand the overall sentiment
> > as:
> >
> > Pco could technically work as a Parquet encoding, but people are wary of
> > its newness and weak FFI support. It seems there is no immediate action
> to
> > take, but would be worthwhile to consider this again further in the
> future.
> >
> > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <m.w.lonca...@gmail.com>
> > wrote:
> >
> >> I must admit I'm a bit surprised by these results. The first thing is
> >>> that the Pcodec results were actually obtained using dictionary
> >>> encoding. Then I don't understand what is Pcodec-encoded: the
> dictionary
> >>> values or the dictionary indices?
> >>
> >>
> >> No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes
> >> and back. Some of Parquet's existing encodings are like this as well.
> >>
> >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
> >>> worse than the PLAIN + Zstd results, which is unexpected (though not
> >>> impossible).
> >>
> >>
> >> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does
> terribly
> >> for this data because there is high correlation among each number's
> bytes.
> >> For instance, if each double is a multiple of 0.1, then the 52 mantissa
> >> bits will look like 011011011011011... (011 repeating). That means there
> >> are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of
> each
> >> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many
> bits
> >> for them.
> >>
> >> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>
> >>>
> >>> Hello Martin,
> >>>
> >>> On Sat, 6 Jan 2024 17:09:07 -0500
> >>> Martin Loncaric <m.w.lonca...@gmail.com>
> >>> wrote:
> >>> > >
> >>> > > It would be very interesting to expand the comparison against
> >>> > > BYTE_STREAM_SPLIT + compression.
> >>> >
> >>> > Antoine: I created one now, at the bottom of the post
> >>> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In
> >>> this
> >>> > case, BYTE_STREAM_SPLIT did worse.
> >>>
> >>> I must admit I'm a bit surprised by these results. The first thing is
> >>> that the Pcodec results were actually obtained using dictionary
> >>> encoding. Then I don't understand what is Pcodec-encoded: the
> dictionary
> >>> values or the dictionary indices?
> >>>
> >>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
> >>> worse than the PLAIN + Zstd results, which is unexpected (though not
> >>> impossible).
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to