Hello! I am testing Zstd with dictionary, and it looks very very promising. I'm confident I can choose settings where it is faster than my own algo while bringing better compression ratio, on "cod" dataset.
So I am happliy retiring my code and switching to Zstd. Would probably mean that we will ship compression implementation as a separate module. It is a pity that I did not find out about Zstd dictionary support earlier, that would mean I could skip a few days of work. Without dictionary the results of Zstd were worse than my own algo, but it was faster. Regards, -- Ilya Kasnacheev пн, 27 авг. 2018 г. в 14:53, Vyacheslav Daradur <daradu...@gmail.com>: > According to my benchmarks - zstd compression algorithm [1] looks very > interesting, it has a high compression ratio with quite good speed. > AFAIK it supports external dictionaries, but I'm not sure about using > it with "on the fly building" dictionaries. Anyway, have look at (it > has ASF 2.0 friendly license). > > Also, here is data generator / loader [1]. If it will be useful for > you we should ask Nikolay Izhikov to share public docs to start. > > [1] https://github.com/facebook/zstd > [2] https://github.com/nizhikov/ignite-cod-data-loader > On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev > <ilya.kasnach...@gmail.com> wrote: > > > > Hello Vyacheslav! > > > > Unfortunately I have not found any efficient algorithms that will allow > me > > to use external dictionary as a pre-processed data structure. If plain > gzip > > is used without dictionary, the compression is around 0.7, as opposed to > > 0.4 that I will get with custom implementation, AFAIR the performance was > > also worse. I didn't really try it with dictionary, but I assume > > performance will be even worse since it will have to scan dictionary > before > > getting to actual data. > > > > We have such a huge array of tests that we can just run them all with > > compression enabled, see if there are any new failures. But the impact of > > my commit is fairly low, it is only triggered when data is written to > page > > (maybe to WAL also?), and we don't really do much frivolous stuff to > pages. > > > > Still, I am very much interested in finding existing compression > > implementations with support of external dictionary; I am also very much > > interested in having different implementations of compression for Apache > > Ignite (such as per page compression) and comparing them by benchmark and > > by code impact. I am also very interested in large standard datasets for > > Apache Ignite (or generators thereof) so that we can run precise > benchmarks > > on various compression schemes. If you have any of the following, please > > get back to me. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <daradu...@gmail.com>: > > > > > Hi Igniters! > > > > > > Ilya, I'm glad to see one more person who is interested in the > > > compression feature in Ignite. > > > > > > I looked through the pull request and want to share following thoughts: > > > > > > It's very dangerous using a custom algorithm in this way - you store > > > serialized data separate from a dictionary and there are a lot of > > > points when we may lose data: rebalancing, serialization errors, node > > > rebooting and so on. > > > > > > I'd suggest the following ways to improve reliability: > > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that > > > allows us to decompress data in any situation > > > - store the dictionary inside page with data > > > > > > Also, we have a lot of discussions [1] [2] about compression on > > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was > > > strictly against a compression on this level. > > > If something has changed since then, you may look through [1] [2] [3] > > > I've done a lot of research in algorithms comparison it may be useful > > > for you. > > > > > > [1] > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html > > > [2] > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html > > > [3] https://issues.apache.org/jira/browse/IGNITE-3592 > > > [4] https://issues.apache.org/jira/browse/IGNITE-5226 > > > [5] https://github.com/daradurvs/ignite-compression > > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <dma...@apache.org> wrote: > > > > > > > > > > > > > > Currently, the dictionary for decompression is only stored on heap. > > > After > > > > > restart there's compressed data in the PDS, but there's no > dictionary > > > :) > > > > > > > > > > > > Basically, it means that I've lost my data, right? How about > persisting > > > > data to disk. > > > > > > > > Overall, we need Vladimir Ozerov to check the contribution. He was > the > > > one > > > > who sponsored the IEP and knows the area best. > > > > > > > > -- > > > > Denis > > > > > > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev < > > > ilya.kasnach...@gmail.com> > > > > wrote: > > > > > > > > > Hello! > > > > > > > > > > It is somewhat a part of IEP-20, since I have updated it with this > > > > > particular direction. > > > > > > > > > > Regards, > > > > > > > > > > -- > > > > > Ilya Kasnacheev > > > > > > > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <dma...@apache.org>: > > > > > > > > > > > Hi Ilya, > > > > > > > > > > > > Sounds terrific! Is this part of the following Ignite enhancement > > > > > proposal? > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > -- > > > > > > Denis > > > > > > > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < > > > > > ilya.kasnach...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > My plan was to add a compression section to cache > configuration, > > > where > > > > > > you > > > > > > > can enable compression, enable key compression (which has > heavier > > > > > > > performance implications), adjust dictionary gathering > settings, > > > and in > > > > > > the > > > > > > > future possibly choose betwen algorithms. In fact I'm not sure, > > > since > > > > > my > > > > > > > assumption is that you can always just use latest&greatest, but > > > maybe > > > > > we > > > > > > > can have e.g. very fast and not very strong vs. slower but > stronger > > > > > one. > > > > > > > > > > > > > > I'm not sure yet if we should share dictionary between all > caches > > > vs. > > > > > > > having separate dictionary for every cache. > > > > > > > > > > > > > > With regards to data format, of course there will be room for > > > further > > > > > > > extension. > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > -- > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <skoz...@gridgain.com > >: > > > > > > > > > > > > > > > Hi Ilya > > > > > > > > > > > > > > > > Is there a plan to introduce it as an option of Ignite > > > configuration? > > > > > > In > > > > > > > > that instead the boolean type I suggest to use the enum and > > > reserve > > > > > the > > > > > > > > ability to extend compressions algorithms in future > > > > > > > > > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > > > > > > ilya.kasnach...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > > > > > I want to share with the developer community my compression > > > > > > prototype. > > > > > > > > > > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as > they > > > are > > > > > > > written > > > > > > > > > to Durable Memory page, operating on a pre-built > dictionary. > > > > > Typical > > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using > > > custom > > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are > > > unaffected > > > > > > > > > entirely. > > > > > > > > > > > > > > > > > > This is akin to DB2's table-level compression[1] but > > > independently > > > > > > > > > invented. > > > > > > > > > > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up > to > > > -25% > > > > > (in > > > > > > > > > throughput) with In-Memory loads. It also means you can fit > > > ~twice > > > > > as > > > > > > > > much > > > > > > > > > data into the same IM cluster, or have higher ram/disk > ratio > > > with > > > > > PDS > > > > > > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > > > > > > > > > > > The code is available as PR 4295[2] (set > > > > > > IGNITE_ENABLE_COMPRESSION=true > > > > > > > > to > > > > > > > > > activate). Note that it will not presently survive a PDS > node > > > > > > restart. > > > > > > > > > The impact is very small, the patch should be applicable to > > > most > > > > > 2.x > > > > > > > > > releases. > > > > > > > > > > > > > > > > > > Sure there's a long way before this prototype can have > hope of > > > > > being > > > > > > > > > included, but first I would like to hear input from fellow > > > > > igniters. > > > > > > > > > > > > > > > > > > See also IEP-20[3]. > > > > > > > > > > > > > > > > > > 1. > > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > > > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > > > > > > 3. > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Sergey Kozlov > > > > > > > > GridGain Systems > > > > > > > > www.gridgain.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best Regards, Vyacheslav D. > > > > > > > -- > Best Regards, Vyacheslav D. >