Re: Pitch for Pcodec Encoding in Parquet

Micah Kornfield Sat, 13 Jan 2024 22:29:43 -0800

>
> - CMIW, it seems to have the potential to achieve good compression ratio on
>    integers having common suffix, e.g. decimal(10,2) values that all have
> .00


This is an interesting point. It looks like the algorithm already does
difference-in-difference encoding, Martin do you have a sense of how much
the difference-in-difference vs the tANS step contributes to the
compression experimental ratio?

Thanks,
Micah

On Sat, Jan 13, 2024 at 9:43 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Martin,
> I agree with Gang's point about tAns.  I opened up an issue against the
> pcodec repo enumerating some items that I think could be improved for
> someone trying to implement this from the spec [1] . We can maybe use that
> as a centralized place to discuss any understanding/clarity issues (I hope
> this feedback is helpful in general)?
>
> Having read a little bit more detail I had a few follow up questions:
> 1.  I assume the way this would be integrated is that each page using the
> encoding would be self-contained (i.e. contains the header,  metadata and
> the datages)? (From the branched thread with Xuwei, I believe this is
> intended to be treated as a parquet encoding and not compression).
> 2.  The docs say this becomes effective at ~20k elements.  Given that is
> the default for number of rows per page in parquet-mr (when paged indexes
> are used), is it effective for exactly 20K elements (on average in the
> benchmark results, what was the average number of values per page?)
> 3.  It looks like there is a patent controversy [2] for ANS algorithms.
> IIUC the patent granted might only affect rAns and not tAns used for
> pCodec, does that match your understanding?  Are you aware of any other
> potential patent issues that could come up related to pcodec?
>
> Thanks,
> Micah
>
> [1] https://github.com/mwlon/pcodec/issues/149
> [2]
> https://en.wikipedia.org/wiki/Asymmetric_numeral_systems#Patent_controversy
>
>
> On Sat, Jan 13, 2024 at 8:23 PM Gang Wu <ust...@gmail.com> wrote:
>
>> Hi Martin,
>>
>> Sorry for chiming in late. I have just read your blog post and the format
>> specs. Below are my two cents:
>>
>> - The PCO spec is a good starting point with good explanation on the
>> format
>>   definition. For people unfamiliar with the background, it would be good
>> to
>>   also include the mechanism of the algorithm. For example, how does tANS
>>   entropy encoding work?
>> - Maintaining JNI in the parquet-mr is a burden. Though we have
>> incorporated
>>   zstd-jni which is provided by hadoop dependency, the library itself
>> provides
>>   a fallback mechanism to leverage pure Java implementation via
>> aircompressor.
>>   So a pure java implementation of PCO is always desirable.
>> - CMIW, it seems to have the potential to achieve good compression ratio
>> on
>>    integers having common suffix, e.g. decimal(10,2) values that all have
>> .00
>>    suffix or timestamp values that all have their milliseconds to be 0.
>> - To add a new parquet format change, the community usually requires two
>>    PoC implementations (w/o restriction on library or language) before a
>> formal
>>    vote.
>>
>> Best,
>> Gang
>>
>> On Sun, Jan 14, 2024 at 12:43 AM Martin Loncaric <m.w.lonca...@gmail.com>
>> wrote:
>>
>> > Micah: I've added a format doc now:
>> > https://github.com/mwlon/pcodec/blob/main/docs/format.md. Would
>> appreciate
>> > any feedback or thoughts on it.
>> >
>> > On Thu, Jan 11, 2024 at 11:47 PM Micah Kornfield <emkornfi...@gmail.com
>> >
>> > wrote:
>> >
>> > > >
>> > > > Pco could technically work as a Parquet encoding, but people are
>> wary
>> > of
>> > > > its newness and weak FFI support. It seems there is no immediate
>> action
>> > > to
>> > > > take, but would be worthwhile to consider this again further in the
>> > > future.
>> > >
>> > >
>> > > I guess I'm more optimistic on the potential gaps.  I think if there
>> > were a
>> > > spec that allowed one to code it from scratch, I'd be willing to take
>> a
>> > > crack at seeing what it would take for another implementation in
>> either
>> > > Java or C++. (I looked at the links you provided but they were
>> somewhat
>> > too
>> > > high-level).  I think having a spec would also guard against the
>> > "newness"
>> > > concern.
>> > >
>> > > I can't say there wouldn't be other technical blockers but at least
>> this
>> > > would be someplace to start?
>> > >
>> > > Cheers,
>> > > Micah
>> > >
>> > > On Thu, Jan 11, 2024 at 7:21 PM Martin Loncaric <
>> m.w.lonca...@gmail.com>
>> > > wrote:
>> > >
>> > > > (Oops, the repeating binary decimal is 1100... with period 4, so
>> > exactly
>> > > 2
>> > > > bits of entropy for the 52 mantissa bits. The argument is the same
>> > > though.)
>> > > >
>> > > > On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric <
>> > m.w.lonca...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > To reach a conclusion on this thread, I understand the overall
>> > > sentiment
>> > > > > as:
>> > > > >
>> > > > > Pco could technically work as a Parquet encoding, but people are
>> wary
>> > > of
>> > > > > its newness and weak FFI support. It seems there is no immediate
>> > action
>> > > > to
>> > > > > take, but would be worthwhile to consider this again further in
>> the
>> > > > future.
>> > > > >
>> > > > > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <
>> > > m.w.lonca...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> I must admit I'm a bit surprised by these results. The first
>> thing
>> > is
>> > > > >>> that the Pcodec results were actually obtained using dictionary
>> > > > >>> encoding. Then I don't understand what is Pcodec-encoded: the
>> > > > dictionary
>> > > > >>> values or the dictionary indices?
>> > > > >>
>> > > > >>
>> > > > >> No, pco cannot be dictionary encoded; it only goes from vec<T> ->
>> > > Bytes
>> > > > >> and back. Some of Parquet's existing encodings are like this as
>> > well.
>> > > > >>
>> > > > >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are
>> > much
>> > > > >>> worse than the PLAIN + Zstd results, which is unexpected (though
>> > not
>> > > > >>> impossible).
>> > > > >>
>> > > > >>
>> > > > >> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does
>> > > > terribly
>> > > > >> for this data because there is high correlation among each
>> number's
>> > > > bytes.
>> > > > >> For instance, if each double is a multiple of 0.1, then the 52
>> > > mantissa
>> > > > >> bits will look like 011011011011011... (011 repeating). That
>> means
>> > > there
>> > > > >> are only 3 possibilities (<2 bits of entropy) for the last 6+
>> bytes
>> > of
>> > > > each
>> > > > >> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as
>> > many
>> > > > bits
>> > > > >> for them.
>> > > > >>
>> > > > >> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <
>> anto...@python.org>
>> > > > >> wrote:
>> > > > >>
>> > > > >>>
>> > > > >>> Hello Martin,
>> > > > >>>
>> > > > >>> On Sat, 6 Jan 2024 17:09:07 -0500
>> > > > >>> Martin Loncaric <m.w.lonca...@gmail.com>
>> > > > >>> wrote:
>> > > > >>> > >
>> > > > >>> > > It would be very interesting to expand the comparison
>> against
>> > > > >>> > > BYTE_STREAM_SPLIT + compression.
>> > > > >>> >
>> > > > >>> > Antoine: I created one now, at the bottom of the post
>> > > > >>> > <
>> https://graphallthethings.com/posts/the-parquet-we-could-have>.
>> > > In
>> > > > >>> this
>> > > > >>> > case, BYTE_STREAM_SPLIT did worse.
>> > > > >>>
>> > > > >>> I must admit I'm a bit surprised by these results. The first
>> thing
>> > is
>> > > > >>> that the Pcodec results were actually obtained using dictionary
>> > > > >>> encoding. Then I don't understand what is Pcodec-encoded: the
>> > > > dictionary
>> > > > >>> values or the dictionary indices?
>> > > > >>>
>> > > > >>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results
>> are
>> > > much
>> > > > >>> worse than the PLAIN + Zstd results, which is unexpected (though
>> > not
>> > > > >>> impossible).
>> > > > >>>
>> > > > >>> Regards
>> > > > >>>
>> > > > >>> Antoine.
>> > > > >>>
>> > > > >>>
>> > > > >>>
>> > > >
>> > >
>> >
>>
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to