In my view, dictionary of 1024 bytes is not going to be nearly enough. On Tue, Sep 4, 2018 at 8:06 AM, Ilya Kasnacheev <ilya.kasnach...@gmail.com> wrote:
> Hello! > > In case of Apache Ignite, most of savings is due to BinaryObject format, > which encodes types and fields with byte sequences. Any enum/string flags > will also get in dictionary. And then as it processes a record it fills up > its individual dictionary. > > But, in one cache, most if not all entries have identical BinaryObject > layout so a tiny dictionary covers that case. Compression algorithms are > not very keen on large dictionaries, preferring to work with local > regularities in byte stream. > > E.g. if we have large entries in cache with low BinaryObject overhead, > they're served just fine by "generic" compression. > > All of the above is my speculations, actually. I just observe that on a > large data set, compression ratio is around 0.4 (2.5x) with a dictionary of > 1024 bytes. The rest is black box. > > Regards, > -- > Ilya Kasnacheev > > > вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <dsetrak...@apache.org>: > > > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev < > ilya.kasnach...@gmail.com > > > > > wrote: > > > > > Hello! > > > > > > Each node has a local dictionary (per node currently, per cache > planned). > > > Dictionary is never shared between nodes. As data patterns shift, > > > dictionary rotation is also planned. > > > > > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine > It > > is > > > enough to store common BinaryObject boilerplate, and everything else is > > > compressed on the fly. The source sample is 16k records. > > > > > > > > Thanks, Ilya, understood. I think per-cache is a better idea. However, I > > have a question about dictionary size. Ignite stores TBs of data. How do > > you plan the dictionary to fit in 1K bytes? > > > > D. > > >