> > - CMIW, it seems to have the potential to achieve good compression ratio on > integers having common suffix, e.g. decimal(10,2) values that all have > .00
This is an interesting point. It looks like the algorithm already does difference-in-difference encoding, Martin do you have a sense of how much the difference-in-difference vs the tANS step contributes to the compression experimental ratio? Thanks, Micah On Sat, Jan 13, 2024 at 9:43 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Martin, > I agree with Gang's point about tAns. I opened up an issue against the > pcodec repo enumerating some items that I think could be improved for > someone trying to implement this from the spec [1] . We can maybe use that > as a centralized place to discuss any understanding/clarity issues (I hope > this feedback is helpful in general)? > > Having read a little bit more detail I had a few follow up questions: > 1. I assume the way this would be integrated is that each page using the > encoding would be self-contained (i.e. contains the header, metadata and > the datages)? (From the branched thread with Xuwei, I believe this is > intended to be treated as a parquet encoding and not compression). > 2. The docs say this becomes effective at ~20k elements. Given that is > the default for number of rows per page in parquet-mr (when paged indexes > are used), is it effective for exactly 20K elements (on average in the > benchmark results, what was the average number of values per page?) > 3. It looks like there is a patent controversy [2] for ANS algorithms. > IIUC the patent granted might only affect rAns and not tAns used for > pCodec, does that match your understanding? Are you aware of any other > potential patent issues that could come up related to pcodec? > > Thanks, > Micah > > [1] https://github.com/mwlon/pcodec/issues/149 > [2] > https://en.wikipedia.org/wiki/Asymmetric_numeral_systems#Patent_controversy > > > On Sat, Jan 13, 2024 at 8:23 PM Gang Wu <ust...@gmail.com> wrote: > >> Hi Martin, >> >> Sorry for chiming in late. I have just read your blog post and the format >> specs. Below are my two cents: >> >> - The PCO spec is a good starting point with good explanation on the >> format >> definition. For people unfamiliar with the background, it would be good >> to >> also include the mechanism of the algorithm. For example, how does tANS >> entropy encoding work? >> - Maintaining JNI in the parquet-mr is a burden. Though we have >> incorporated >> zstd-jni which is provided by hadoop dependency, the library itself >> provides >> a fallback mechanism to leverage pure Java implementation via >> aircompressor. >> So a pure java implementation of PCO is always desirable. >> - CMIW, it seems to have the potential to achieve good compression ratio >> on >> integers having common suffix, e.g. decimal(10,2) values that all have >> .00 >> suffix or timestamp values that all have their milliseconds to be 0. >> - To add a new parquet format change, the community usually requires two >> PoC implementations (w/o restriction on library or language) before a >> formal >> vote. >> >> Best, >> Gang >> >> On Sun, Jan 14, 2024 at 12:43 AM Martin Loncaric <m.w.lonca...@gmail.com> >> wrote: >> >> > Micah: I've added a format doc now: >> > https://github.com/mwlon/pcodec/blob/main/docs/format.md. Would >> appreciate >> > any feedback or thoughts on it. >> > >> > On Thu, Jan 11, 2024 at 11:47 PM Micah Kornfield <emkornfi...@gmail.com >> > >> > wrote: >> > >> > > > >> > > > Pco could technically work as a Parquet encoding, but people are >> wary >> > of >> > > > its newness and weak FFI support. It seems there is no immediate >> action >> > > to >> > > > take, but would be worthwhile to consider this again further in the >> > > future. >> > > >> > > >> > > I guess I'm more optimistic on the potential gaps. I think if there >> > were a >> > > spec that allowed one to code it from scratch, I'd be willing to take >> a >> > > crack at seeing what it would take for another implementation in >> either >> > > Java or C++. (I looked at the links you provided but they were >> somewhat >> > too >> > > high-level). I think having a spec would also guard against the >> > "newness" >> > > concern. >> > > >> > > I can't say there wouldn't be other technical blockers but at least >> this >> > > would be someplace to start? >> > > >> > > Cheers, >> > > Micah >> > > >> > > On Thu, Jan 11, 2024 at 7:21 PM Martin Loncaric < >> m.w.lonca...@gmail.com> >> > > wrote: >> > > >> > > > (Oops, the repeating binary decimal is 1100... with period 4, so >> > exactly >> > > 2 >> > > > bits of entropy for the 52 mantissa bits. The argument is the same >> > > though.) >> > > > >> > > > On Thu, Jan 11, 2024 at 10:02 PM Martin Loncaric < >> > m.w.lonca...@gmail.com >> > > > >> > > > wrote: >> > > > >> > > > > To reach a conclusion on this thread, I understand the overall >> > > sentiment >> > > > > as: >> > > > > >> > > > > Pco could technically work as a Parquet encoding, but people are >> wary >> > > of >> > > > > its newness and weak FFI support. It seems there is no immediate >> > action >> > > > to >> > > > > take, but would be worthwhile to consider this again further in >> the >> > > > future. >> > > > > >> > > > > On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric < >> > > m.w.lonca...@gmail.com> >> > > > > wrote: >> > > > > >> > > > >> I must admit I'm a bit surprised by these results. The first >> thing >> > is >> > > > >>> that the Pcodec results were actually obtained using dictionary >> > > > >>> encoding. Then I don't understand what is Pcodec-encoded: the >> > > > dictionary >> > > > >>> values or the dictionary indices? >> > > > >> >> > > > >> >> > > > >> No, pco cannot be dictionary encoded; it only goes from vec<T> -> >> > > Bytes >> > > > >> and back. Some of Parquet's existing encodings are like this as >> > well. >> > > > >> >> > > > >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are >> > much >> > > > >>> worse than the PLAIN + Zstd results, which is unexpected (though >> > not >> > > > >>> impossible). >> > > > >> >> > > > >> >> > > > >> I explained briefly in the blog post, but BYTE_STREAM_SPLIT does >> > > > terribly >> > > > >> for this data because there is high correlation among each >> number's >> > > > bytes. >> > > > >> For instance, if each double is a multiple of 0.1, then the 52 >> > > mantissa >> > > > >> bits will look like 011011011011011... (011 repeating). That >> means >> > > there >> > > > >> are only 3 possibilities (<2 bits of entropy) for the last 6+ >> bytes >> > of >> > > > each >> > > > >> number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as >> > many >> > > > bits >> > > > >> for them. >> > > > >> >> > > > >> On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou < >> anto...@python.org> >> > > > >> wrote: >> > > > >> >> > > > >>> >> > > > >>> Hello Martin, >> > > > >>> >> > > > >>> On Sat, 6 Jan 2024 17:09:07 -0500 >> > > > >>> Martin Loncaric <m.w.lonca...@gmail.com> >> > > > >>> wrote: >> > > > >>> > > >> > > > >>> > > It would be very interesting to expand the comparison >> against >> > > > >>> > > BYTE_STREAM_SPLIT + compression. >> > > > >>> > >> > > > >>> > Antoine: I created one now, at the bottom of the post >> > > > >>> > < >> https://graphallthethings.com/posts/the-parquet-we-could-have>. >> > > In >> > > > >>> this >> > > > >>> > case, BYTE_STREAM_SPLIT did worse. >> > > > >>> >> > > > >>> I must admit I'm a bit surprised by these results. The first >> thing >> > is >> > > > >>> that the Pcodec results were actually obtained using dictionary >> > > > >>> encoding. Then I don't understand what is Pcodec-encoded: the >> > > > dictionary >> > > > >>> values or the dictionary indices? >> > > > >>> >> > > > >>> The second thing is that the BYTE_STREAM_SPLIT + Zstd results >> are >> > > much >> > > > >>> worse than the PLAIN + Zstd results, which is unexpected (though >> > not >> > > > >>> impossible). >> > > > >>> >> > > > >>> Regards >> > > > >>> >> > > > >>> Antoine. >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >> > > >> > >> >