Re: Pitch for Pcodec Encoding in Parquet

Martin Loncaric Tue, 30 Jan 2024 19:15:47 -0800

Gang: thanks for all the good info. I've iterated a bit on the spec
document now. To answer your question,


> ... it seems to have the potential to achieve good compression ratio on
> integers having common suffix...

Absolutely. And it's somewhat robust when a small fraction of numbers break
the pattern.

Gang and Micah: I don't intend to explain tANS from first principles in the
spec document; I'm just linking to the original paper, which defines it
precisely. Zstd also uses it (under the name Finite State Entropy).

Micah:
1. I think there are 2 possible ways to go about it: either we make the
Parquet file, column chunk, and data page map 1:1 with Pcodec file, chunk,
and page; or put a whole pcodec chunk in each data page. I think the first
choice is more ideal, but either one has implications. We should probably
spin this off into a separate thread when we get serious about tackling
this one.
2. (jointly with 1st bullet) I think it's better if we have larger Pcodec
chunks that match up with column chunks, but empirically Pcodec still seems
somewhat valuable even if we can only compress each data page independently.
3. Yes, my understanding is that only rANS is possibly affected, so tANS is
perfectly fine. I'm not sure if Microsoft ever intends to act on their
patent, but it's an extremely dubious one considering that Jarek Duda
published it before they were ever in the picture.

It looks like the algorithm already does
> difference-in-difference encoding, Martin do you have a sense of how much
> the difference-in-difference vs the tANS step contributes to the
> compression experimental ratio?
>
It depends on the data, but I do have a sense of it. Two clarifications,
just in case:

   - Pcodec automatically detects the best delta encoding order. In the
   vast majority of cases, 0th (nothing) or 1st (delta) order is best, and
   that was true for all the benchmarks I showed (even the air quality data,
   which is a time series). 2nd order (delta-of-delta) and up only help when
   the data is extremely smooth. They don't contribute much to Pcodec. Some
   time series DBs like Gorilla choose delta-of-delta on timestamps a priori,
   but that's because their encodings are statically best for near-0 data,
   whereas post-delta-encoding Pcodec can handle any stationary distribution
   well (all 777's is just as good as all 0's).
   - Switching from Huffman codes to tANS gave <1% improvement to
   compression ratio on most of my benchmarks, so tANS itself isn't a silver
   bullet (though I don't think this is what you were asking).

The big compression ratio wins come from:

   - Not losing information about bye values. LZ7* techniques treat bytes
   as tokens, ignoring the ordering of numbers. Pcodec treats numbers as
   numbers, and this explains something like +30% compression ratio vs LZ7*
   approaches on most interesting distributions. Lightweight encodings don't
   change this fact.
   - Holistically combining "modes" (like the 0.01x multiplier stuff),
   delta encoding, and binning, and automatically detecting the best way to do
   each one. On some datasets these don't matter, but when they do, they can
   explain much more of the compression ratio difference. The air quality
   dataset benefits from pcodec's improved interplay between delta and
   binning; and the taxi dataset benefits from FloatMult mode since it
   contains lots of decimal prices.

Antoine:
1. Certainly. Though as I've said, Pcodec does the work that both
encoding+compression would normally do together, and Pcodec is simpler than
some of Parquet's compression libraries.
2. I'm definitely the main maintainer, but there are some small
contributors. Two people are helping me with FFI for Python and C now.
3. Pcodec has a couple of avid users, but no big ones. Its predecessor
actually gets more downloads <https://crates.io/crates/q_compress>. I
should probably proselytize Pcodec a bit to get the word out.



On Mon, Jan 15, 2024 at 12:13 PM Antoine Pitrou <anto...@python.org> wrote:

>
> My personal sentiment is: not only its newness, but the fact that it is
>
> 1) highly non-trivial (it seems much more complicated than all other
> Parquet encodings);
> 2) maintained by a single person, both the spec and the implementation
> (please correct me if I'm wrong?); and
> 3) has little to no adoption currently (again, please correct me if
> I'm wrong?).
>
> Of course the adoption issue is a chicken-and-egg problem, but given
> that Parquet files are used for long-term storage (not just transient
> data), it's probably not a good idea to be an early adopter here.
>
> And of course, if the encoding was simpler, points 2 and 3 wouldn't
> really hurt.
>
> This is just my opinion!
>
> Regards
>
> Antoine.
>
>
> On Thu, 11 Jan 2024 22:02:02 -0500
> Martin Loncaric <m.w.lonca...@gmail.com>
> wrote:
> > To reach a conclusion on this thread, I understand the overall sentiment
> as:
> >
> > Pco could technically work as a Parquet encoding, but people are wary of
> > its newness and weak FFI support. It seems there is no immediate action
> to
> > take, but would be worthwhile to consider this again further in the
> future.
> >
> > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <m.w.lonca...@gmail.com>
> > wrote:
> >
> > > I must admit I'm a bit surprised by these results. The first thing is
> > >> that the Pcodec results were actually obtained using dictionary
> > >> encoding. Then I don't understand what is Pcodec-encoded: the
> dictionary
> > >> values or the dictionary indices?
> > >
> > >
> > > No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes
> > > and back. Some of Parquet's existing encodings are like this as well.
> > >
> > > The second thing is that the BYTE_STREAM_SPLIT + Zstd results are
> much
> > >> worse than the PLAIN + Zstd results, which is unexpected (though not
> > >> impossible).
> > >
> > >
> > > I explained briefly in the blog post, but BYTE_STREAM_SPLIT does
> terribly
> > > for this data because there is high correlation among each number's
> bytes.
> > > For instance, if each double is a multiple of 0.1, then the 52 mantissa
> > > bits will look like 011011011011011... (011 repeating). That means
> there
> > > are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of
> each
> > > number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many
> bits
> > > for them.
> > >
> > > On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou
> <antoine-+zn9apsxkcfqfi55v6+...@public.gmane.orgg> wrote:
> > >
> > >>
> > >> Hello Martin,
> > >>
> > >> On Sat, 6 Jan 2024 17:09:07 -0500
> > >> Martin Loncaric <m.w.lonca...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > It would be very interesting to expand the comparison against
> > >> > > BYTE_STREAM_SPLIT + compression.
> > >> >
> > >> > Antoine: I created one now, at the bottom of the post
> > >> > <https://graphallthethings.com/posts/the-parquet-we-could-have>.
> In
> > >> this
> > >> > case, BYTE_STREAM_SPLIT did worse.
> > >>
> > >> I must admit I'm a bit surprised by these results. The first thing is
> > >> that the Pcodec results were actually obtained using dictionary
> > >> encoding. Then I don't understand what is Pcodec-encoded: the
> dictionary
> > >> values or the dictionary indices?
> > >>
> > >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
> > >> worse than the PLAIN + Zstd results, which is unexpected (though not
> > >> impossible).
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >>
> >
>
>
>
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to