Re: Pitch for Pcodec Encoding in Parquet

Martin Loncaric Thu, 11 Jan 2024 19:02:22 -0800

To reach a conclusion on this thread, I understand the overall sentiment as:


Pco could technically work as a Parquet encoding, but people are wary of
its newness and weak FFI support. It seems there is no immediate action to
take, but would be worthwhile to consider this again further in the future.

On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <m.w.lonca...@gmail.com>
wrote:

> I must admit I'm a bit surprised by these results. The first thing is
>> that the Pcodec results were actually obtained using dictionary
>> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
>> values or the dictionary indices?
>
>
> No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes
> and back. Some of Parquet's existing encodings are like this as well.
>
> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
>> worse than the PLAIN + Zstd results, which is unexpected (though not
>> impossible).
>
>
> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does terribly
> for this data because there is high correlation among each number's bytes.
> For instance, if each double is a multiple of 0.1, then the 52 mantissa
> bits will look like 011011011011011... (011 repeating). That means there
> are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of each
> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits
> for them.
>
> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou <anto...@python.org> wrote:
>
>>
>> Hello Martin,
>>
>> On Sat, 6 Jan 2024 17:09:07 -0500
>> Martin Loncaric <m.w.lonca...@gmail.com>
>> wrote:
>> > >
>> > > It would be very interesting to expand the comparison against
>> > > BYTE_STREAM_SPLIT + compression.
>> >
>> > Antoine: I created one now, at the bottom of the post
>> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In
>> this
>> > case, BYTE_STREAM_SPLIT did worse.
>>
>> I must admit I'm a bit surprised by these results. The first thing is
>> that the Pcodec results were actually obtained using dictionary
>> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
>> values or the dictionary indices?
>>
>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
>> worse than the PLAIN + Zstd results, which is unexpected (though not
>> impossible).
>>
>> Regards
>>
>> Antoine.
>>
>>
>>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to