Gang: thanks for all the good info. I've iterated a bit on the spec document now. To answer your question,
> ... it seems to have the potential to achieve good compression ratio on > integers having common suffix... Absolutely. And it's somewhat robust when a small fraction of numbers break the pattern. Gang and Micah: I don't intend to explain tANS from first principles in the spec document; I'm just linking to the original paper, which defines it precisely. Zstd also uses it (under the name Finite State Entropy). Micah: 1. I think there are 2 possible ways to go about it: either we make the Parquet file, column chunk, and data page map 1:1 with Pcodec file, chunk, and page; or put a whole pcodec chunk in each data page. I think the first choice is more ideal, but either one has implications. We should probably spin this off into a separate thread when we get serious about tackling this one. 2. (jointly with 1st bullet) I think it's better if we have larger Pcodec chunks that match up with column chunks, but empirically Pcodec still seems somewhat valuable even if we can only compress each data page independently. 3. Yes, my understanding is that only rANS is possibly affected, so tANS is perfectly fine. I'm not sure if Microsoft ever intends to act on their patent, but it's an extremely dubious one considering that Jarek Duda published it before they were ever in the picture. It looks like the algorithm already does > difference-in-difference encoding, Martin do you have a sense of how much > the difference-in-difference vs the tANS step contributes to the > compression experimental ratio? > It depends on the data, but I do have a sense of it. Two clarifications, just in case: - Pcodec automatically detects the best delta encoding order. In the vast majority of cases, 0th (nothing) or 1st (delta) order is best, and that was true for all the benchmarks I showed (even the air quality data, which is a time series). 2nd order (delta-of-delta) and up only help when the data is extremely smooth. They don't contribute much to Pcodec. Some time series DBs like Gorilla choose delta-of-delta on timestamps a priori, but that's because their encodings are statically best for near-0 data, whereas post-delta-encoding Pcodec can handle any stationary distribution well (all 777's is just as good as all 0's). - Switching from Huffman codes to tANS gave <1% improvement to compression ratio on most of my benchmarks, so tANS itself isn't a silver bullet (though I don't think this is what you were asking). The big compression ratio wins come from: - Not losing information about bye values. LZ7* techniques treat bytes as tokens, ignoring the ordering of numbers. Pcodec treats numbers as numbers, and this explains something like +30% compression ratio vs LZ7* approaches on most interesting distributions. Lightweight encodings don't change this fact. - Holistically combining "modes" (like the 0.01x multiplier stuff), delta encoding, and binning, and automatically detecting the best way to do each one. On some datasets these don't matter, but when they do, they can explain much more of the compression ratio difference. The air quality dataset benefits from pcodec's improved interplay between delta and binning; and the taxi dataset benefits from FloatMult mode since it contains lots of decimal prices. Antoine: 1. Certainly. Though as I've said, Pcodec does the work that both encoding+compression would normally do together, and Pcodec is simpler than some of Parquet's compression libraries. 2. I'm definitely the main maintainer, but there are some small contributors. Two people are helping me with FFI for Python and C now. 3. Pcodec has a couple of avid users, but no big ones. Its predecessor actually gets more downloads <https://crates.io/crates/q_compress>. I should probably proselytize Pcodec a bit to get the word out. On Mon, Jan 15, 2024 at 12:13 PM Antoine Pitrou <anto...@python.org> wrote: > > My personal sentiment is: not only its newness, but the fact that it is > > 1) highly non-trivial (it seems much more complicated than all other > Parquet encodings); > 2) maintained by a single person, both the spec and the implementation > (please correct me if I'm wrong?); and > 3) has little to no adoption currently (again, please correct me if > I'm wrong?). > > Of course the adoption issue is a chicken-and-egg problem, but given > that Parquet files are used for long-term storage (not just transient > data), it's probably not a good idea to be an early adopter here. > > And of course, if the encoding was simpler, points 2 and 3 wouldn't > really hurt. > > This is just my opinion! > > Regards > > Antoine. > > > On Thu, 11 Jan 2024 22:02:02 -0500 > Martin Loncaric <m.w.lonca...@gmail.com> > wrote: > > To reach a conclusion on this thread, I understand the overall sentiment > as: > > > > Pco could technically work as a Parquet encoding, but people are wary of > > its newness and weak FFI support. It seems there is no immediate action > to > > take, but would be worthwhile to consider this again further in the > future. > > > > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric <m.w.lonca...@gmail.com> > > wrote: > > > > > I must admit I'm a bit surprised by these results. The first thing is > > >> that the Pcodec results were actually obtained using dictionary > > >> encoding. Then I don't understand what is Pcodec-encoded: the > dictionary > > >> values or the dictionary indices? > > > > > > > > > No, pco cannot be dictionary encoded; it only goes from vec<T> -> Bytes > > > and back. Some of Parquet's existing encodings are like this as well. > > > > > > The second thing is that the BYTE_STREAM_SPLIT + Zstd results are > much > > >> worse than the PLAIN + Zstd results, which is unexpected (though not > > >> impossible). > > > > > > > > > I explained briefly in the blog post, but BYTE_STREAM_SPLIT does > terribly > > > for this data because there is high correlation among each number's > bytes. > > > For instance, if each double is a multiple of 0.1, then the 52 mantissa > > > bits will look like 011011011011011... (011 repeating). That means > there > > > are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of > each > > > number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many > bits > > > for them. > > > > > > On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou > <antoine-+zn9apsxkcfqfi55v6+...@public.gmane.orgg> wrote: > > > > > >> > > >> Hello Martin, > > >> > > >> On Sat, 6 Jan 2024 17:09:07 -0500 > > >> Martin Loncaric <m.w.lonca...@gmail.com> > > >> wrote: > > >> > > > > >> > > It would be very interesting to expand the comparison against > > >> > > BYTE_STREAM_SPLIT + compression. > > >> > > > >> > Antoine: I created one now, at the bottom of the post > > >> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. > In > > >> this > > >> > case, BYTE_STREAM_SPLIT did worse. > > >> > > >> I must admit I'm a bit surprised by these results. The first thing is > > >> that the Pcodec results were actually obtained using dictionary > > >> encoding. Then I don't understand what is Pcodec-encoded: the > dictionary > > >> values or the dictionary indices? > > >> > > >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much > > >> worse than the PLAIN + Zstd results, which is unexpected (though not > > >> impossible). > > >> > > >> Regards > > >> > > >> Antoine. > > >> > > >> > > >> > > > > > >