Hello. I updated the KIP page for this issue, following Ismael's guide. https://cwiki.apache.org/confluence/display/KAFKA/KIP-110%3A+Add+Codec+for+ ZStandard+Compression
The main difference between the previous version is: - Motivation: Added real-world benchmark. (Thanks to Ivan) - Accompanying Issues: especially the problems related to the old clients & brokers. Please have a look when you are free. Thanks, Dongjin On Sat, Feb 4, 2017 at 11:51 AM, Jeff Widman <j...@netskope.com> wrote: > *My concern is that this minor improvement (based on the benchmark) over > LZ4 does not warrant the work of adding support for a new compression codec > to the broker, official clients and horde of 3rd party clients, including > upgrade paths, transformations, tests, additional dependencies, etc.* > > > As someone who primarily works with non-Java clients, I couldn't +1 this > enough. > > If there's truly a benefit (and probably there is...the Facebook post > announcing ZStandard had some fairly compelling benchmarks) then yes, let's > do this. But otherwise, please no. > > On Tue, Jan 31, 2017 at 1:47 AM, Magnus Edenhill <mag...@edenhill.se> > wrote: > > > Hi Dongjin and good work on the KIP, > > > > I understand that ZStandard is generally considered an improvement over > > LZ4, but the > > benchmark you provided on the KIP-110 wiki doesn't really reflect that, > and > > even > > makes a note that they are comparable: > > *> As you can see above, ZStandard shows outstanding performance in both > of > > compression rate and speed, especially working with the speed-first > setting > > (level 1). To the extent that only LZ4 can be compared to ZStandard.* > > > > My concern is that this minor improvement (based on the benchmark) over > LZ4 > > does not warrant the work > > of adding support for a new compression codec to the broker, official > > clients and horde of 3rd party clients, including > > upgrade paths, transformations, tests, additional dependencies, etc. > > > > Is it possible to produce more convincing comparisons? > > > > Thanks, > > Magnus > > > > > > > > > > > > 2017-01-31 10:28 GMT+01:00 Dongjin Lee <dong...@apache.org>: > > > > > Ismael & All, > > > > > > After Inspecting the related code & commits, I concluded following: > > > > > > 1. We have to update the masking value which is used to retrieve the > used > > > codec id from the messages, to enable the retrieval of the 3rd bit of > > > compression type field of the message. > > > 2. The above task is already done; so, we need nothing. > > > > > > Here is why. > > > > > > Let's start with the first one, with the scenario Juma proposed. In the > > > case of receiving the message of unsupported compression type, the > > receiver > > > (= broker or consumer) raises IllegalArgumentException[^1][^2]. The > key > > > element in this operation is Record#COMPRESSION_CODEC_MASK, which is > used > > > to extract the codec id. We have to update this value from 2-bits > > extractor > > > (=0x03) to 3-bits extractor (=0x07). > > > > > > But in fact, this task is already done, so its current value is 0x07. > We > > > don't have to update it. > > > > > > The reason why this task is already done has some story; From the first > > > time Record.java file was added to the project[^3], the > > > COMPRESSION_CODEC_MASK was already 2-bits extractor, that is, 0x03. At > > that > > > time, Kafka supported only two compression types - GZipCompression and > > > SnappyCompression.[^4] After that, KAFKA-1456 introduced two additional > > > codecs of LZ4 and LZ4C[^5]. This update modified COMPRESSION_CODEC_MASK > > > into 3 bits extractor, 0x07, in the aim of supporting four compression > > > codecs. > > > > > > Although its following work, KAFKA-1493, removed the support of LZ4C > > > codec[^6], this mask was not reverted into 2-bits extractor - by this > > > reason, we don't need to care about the message format. > > > > > > Attached screenshot is my test on Juma's scenario. I created topic & > sent > > > some messages using the snapshot version with ZStandard compression and > > > received the message with the latest version. As you can see, it works > > > perfectly as expected. > > > > > > If you have more opinion to this issue, don't hesitate to send me as a > > > message. > > > > > > Best, > > > Dongjin > > > > > > [^1]: see: Record#compressionType. > > > [^2]: There is similar routine in Message.scala. But after KAFKA-4390, > > > that routine is not being used anymore - more precisely, Message class > is > > > now used in ConsoleConsumer only. I think this class should be replaced > > but > > > since it is a separated topic, I will send another message for this > > issue. > > > [^3]: commit 642da2f (2011.8.2). > > > [^4]: commit c51b940. > > > [^5]: commit 547cced. > > > [^6]: commit 37356bf. > > > > > > On Thu, Jan 26, 2017 at 12:35 AM, Ismael Juma <ism...@juma.me.uk> > wrote: > > > > > >> So far the discussion was around the performance characteristics of > the > > >> new > > >> compression algorithm. Another area that is important and is not > covered > > >> in > > >> the KIP is the compatibility implications. For example, what happens > if > > a > > >> consumer that doesn't support zstd tries to consume a topic compressed > > >> with > > >> it? Or if a broker that doesn't support receives data compressed with > > it? > > >> If we go through that exercise, then more changes may be required > (like > > >> bumping the version of produce/fetch protocols). > > >> > > >> Ismael > > >> > > >> On Wed, Jan 25, 2017 at 3:22 PM, Ben Stopford <b...@confluent.io> > wrote: > > >> > > >> > Is there more discussion to be had on this KIP, or should it be > taken > > >> to a > > >> > vote? > > >> > > > >> > On Mon, Jan 16, 2017 at 6:37 AM Dongjin Lee <dong...@apache.org> > > wrote: > > >> > > > >> > > I updated KIP-110 with JMH-measured benchmark results. Please > have a > > >> > review > > >> > > when you are free. (The overall result is not different yet.) > > >> > > > > >> > > Regards, > > >> > > Dongjin > > >> > > > > >> > > +1. Could anyone assign KAFKA-4514 to me? > > >> > > > > >> > > On Thu, Jan 12, 2017 at 11:39 AM, Dongjin Lee <dong...@apache.org > > > > >> > wrote: > > >> > > > > >> > > > Okay, I will have a try. > > >> > > > Thanks Ewen for the guidance!! > > >> > > > > > >> > > > Best, > > >> > > > Dongjin > > >> > > > > > >> > > > On Thu, Jan 12, 2017 at 6:44 AM, Ismael Juma <ism...@juma.me.uk > > > > >> > wrote: > > >> > > > > > >> > > >> That's a good point Ewen. Dongjin, you could use the branch > that > > >> Ewen > > >> > > >> linked for the performance testing. It would also help validate > > the > > >> > PR. > > >> > > >> > > >> > > >> Ismael > > >> > > >> > > >> > > >> On Wed, Jan 11, 2017 at 9:38 PM, Ewen Cheslack-Postava < > > >> > > e...@confluent.io > > >> > > >> > > > >> > > >> wrote: > > >> > > >> > > >> > > >> > FYI, there's an outstanding patch for getting some JMH > > >> benchmarking > > >> > > >> setup: > > >> > > >> > https://github.com/apache/kafka/pull/1712 I haven't found > time > > >> to > > >> > > >> review > > >> > > >> > it > > >> > > >> > (and don't really know JMH well anyway) but it might be worth > > >> > getting > > >> > > >> that > > >> > > >> > landed so we can use it for this as well. > > >> > > >> > > > >> > > >> > -Ewen > > >> > > >> > > > >> > > >> > On Wed, Jan 11, 2017 at 6:35 AM, Dongjin Lee < > > dong...@apache.org > > >> > > > >> > > >> wrote: > > >> > > >> > > > >> > > >> > > Hi Ismael, > > >> > > >> > > > > >> > > >> > > 1. In the case of compression output, yes, lz4 is producing > > the > > >> > > >> smaller > > >> > > >> > > output than gzip. In fact, my benchmark was inspired > > >> > > >> > > by MessageCompressionTest#testCompressSize unit test and > the > > >> > result > > >> > > >> is > > >> > > >> > > same - 396 bytes for gzip and 387 bytes for lz4. > > >> > > >> > > 2. I agree that my (former) approach can result in > unreliable > > >> > > output. > > >> > > >> > > However, I am experiencing difficulties on how to acquire > the > > >> > > >> benchmark > > >> > > >> > > metrics from Kafka. For you recommended JMH, I just started > > to > > >> > > google > > >> > > >> for > > >> > > >> > > it. If possible, could you give any example on how to use > JMH > > >> > > against > > >> > > >> > > Kafka? If it is the case, it will be a great help. > > >> > > >> > > Regards,Dongjin > > >> > > >> > > > > >> > > >> > > _____________________________ > > >> > > >> > > From: Ismael Juma <ism...@juma.me.uk> > > >> > > >> > > Sent: Wednesday, January 11, 2017 7:33 PM > > >> > > >> > > Subject: Re: [DISCUSS] KIP-110: Add Codec for ZStandard > > >> > Compression > > >> > > >> > > To: <dev@kafka.apache.org> > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > Thanks Dongjin. I highly recommend using JMH for the > > benchmark, > > >> > the > > >> > > >> > > existing one has a few problems that could result in > > unreliable > > >> > > >> results. > > >> > > >> > > Also, it's a bit surprising that LZ4 is producing smaller > > >> output > > >> > > than > > >> > > >> > gzip. > > >> > > >> > > Is that right? > > >> > > >> > > > > >> > > >> > > Ismael > > >> > > >> > > > > >> > > >> > > On Wed, Jan 11, 2017 at 10:20 AM, Dongjin Lee < > > >> dong...@apache.org > > >> > > > > >> > > >> > wrote: > > >> > > >> > > > > >> > > >> > > > Ismael, > > >> > > >> > > > > > >> > > >> > > > I pushed the benchmark code I used, with some updates > > >> > (iteration: > > >> > > >> 20 -> > > >> > > >> > > > 1000). I also updated the KIP page with the updated > > benchmark > > >> > > >> results. > > >> > > >> > > > Please take a review when you are free. The attached > > >> screenshot > > >> > > >> shows > > >> > > >> > how > > >> > > >> > > > to run the benchmarker. > > >> > > >> > > > > > >> > > >> > > > Thanks, > > >> > > >> > > > Dongjin > > >> > > >> > > > > > >> > > >> > > > On Tue, Jan 10, 2017 at 8:03 PM, Dongjin Lee < > > >> > dong...@apache.org> > > >> > > >> > wrote: > > >> > > >> > > > > > >> > > >> > > >> Ismael, > > >> > > >> > > >> > > >> > > >> > > >> I see. Then, I will share the benchmark code I used by > > >> > tomorrow. > > >> > > >> > Thanks > > >> > > >> > > >> for your guidance. > > >> > > >> > > >> > > >> > > >> > > >> Best, > > >> > > >> > > >> Dongjin > > >> > > >> > > >> > > >> > > >> > > >> ----- > > >> > > >> > > >> > > >> > > >> > > >> Dongjin Lee > > >> > > >> > > >> > > >> > > >> > > >> Software developer in Line+. > > >> > > >> > > >> So interested in massive-scale machine learning. > > >> > > >> > > >> > > >> > > >> > > >> facebook: www.facebook.com/dongjin.lee.kr > > >> > > >> > > >> linkedin: kr.linkedin.com/in/dongjinleekr > > >> > > >> > > >> github: github.com/dongjinleekr > > >> > > >> > > >> twitter: www.twitter.com/dongjinleekr > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Tue, Jan 10, 2017 at 7:24 PM +0900, "Ismael Juma" < > > >> > > >> > ism...@juma.me.uk > > >> > > >> > > > > > >> > > >> > > >> wrote: > > >> > > >> > > >> > > >> > > >> > > >> Dongjin, > > >> > > >> > > >>> > > >> > > >> > > >>> The KIP states: > > >> > > >> > > >>> > > >> > > >> > > >>> "I compared the compressed size and compression time > of 3 > > >> > > >> 1kb-sized > > >> > > >> > > >>> messages (3102 bytes in total), with the > > >> Draft-implementation > > >> > of > > >> > > >> > > ZStandard > > >> > > >> > > >>> Compression Codec and all currently available > > >> > CompressionCodecs. > > >> > > >> All > > >> > > >> > > >>> elapsed times are the average of 20 trials." > > >> > > >> > > >>> > > >> > > >> > > >>> But doesn't give any details of how this was > implemented. > > >> Is > > >> > the > > >> > > >> > source > > >> > > >> > > >>> code available somewhere? Micro-benchmarking in the JVM > > is > > >> > > pretty > > >> > > >> > > tricky so > > >> > > >> > > >>> it needs verification before numbers can be trusted. A > > >> > > performance > > >> > > >> > test > > >> > > >> > > >>> with kafka-producer-perf-test.sh would be nice to have > as > > >> > well, > > >> > > if > > >> > > >> > > possible. > > >> > > >> > > >>> > > >> > > >> > > >>> Thanks, > > >> > > >> > > >>> Ismael > > >> > > >> > > >>> > > >> > > >> > > >>> On Tue, Jan 10, 2017 at 7:44 AM, Dongjin Lee wrote: > > >> > > >> > > >>> > > >> > > >> > > >>> > Ismael, > > >> > > >> > > >>> > > > >> > > >> > > >>> > 1. Is the benchmark in the KIP page not enough? You > > mean > > >> we > > >> > > >> need a > > >> > > >> > > whole > > >> > > >> > > >>> > performance test using kafka-producer-perf-test.sh? > > >> > > >> > > >>> > > > >> > > >> > > >>> > 2. It seems like no major project is relying on it > > >> > currently. > > >> > > >> > > However, > > >> > > >> > > >>> > after reviewing the code, I concluded that at least > > this > > >> > > project > > >> > > >> > has > > >> > > >> > > a good > > >> > > >> > > >>> > test coverage. And for the problem of upstream > > tracking - > > >> > > >> although > > >> > > >> > > there is > > >> > > >> > > >>> > no significant update on ZStandard to judge this > > >> problem, it > > >> > > >> seems > > >> > > >> > > not bad. > > >> > > >> > > >>> > If required, I can take responsibility of the > tracking > > >> for > > >> > > this > > >> > > >> > > library. > > >> > > >> > > >>> > > > >> > > >> > > >>> > Thanks, > > >> > > >> > > >>> > Dongjin > > >> > > >> > > >>> > > > >> > > >> > > >>> > On Tue, Jan 10, 2017 at 7:09 AM, Ismael Juma wrote: > > >> > > >> > > >>> > > > >> > > >> > > >>> > > Thanks for posting the KIP, ZStandard looks like a > > nice > > >> > > >> > > improvement over > > >> > > >> > > >>> > > the existing compression algorithms. A couple of > > >> > questions: > > >> > > >> > > >>> > > > > >> > > >> > > >>> > > 1. Can you please elaborate on the details of the > > >> > benchmark? > > >> > > >> > > >>> > > 2. About https://github.com/luben/zstd-jni, can we > > >> rely > > >> > on > > >> > > >> it? A > > >> > > >> > > few > > >> > > >> > > >>> > > things > > >> > > >> > > >>> > > to consider: are there other projects using it, > does > > it > > >> > have > > >> > > >> good > > >> > > >> > > test > > >> > > >> > > >>> > > coverage, are there performance tests, does it > track > > >> > > upstream > > >> > > >> > > closely? > > >> > > >> > > >>> > > > > >> > > >> > > >>> > > Thanks, > > >> > > >> > > >>> > > Ismael > > >> > > >> > > >>> > > > > >> > > >> > > >>> > > On Fri, Jan 6, 2017 at 2:40 AM, Dongjin Lee wrote: > > >> > > >> > > >>> > > > > >> > > >> > > >>> > > > Hi all, > > >> > > >> > > >>> > > > > > >> > > >> > > >>> > > > I've just posted a new KIP "KIP-110: Add Codec > for > > >> > > ZStandard > > >> > > >> > > >>> > Compression" > > >> > > >> > > >>> > > > for > > >> > > >> > > >>> > > > discussion: > > >> > > >> > > >>> > > > > > >> > > >> > > >>> > > > https://cwiki.apache.org/confl > > >> uence/display/KAFKA/KIP- > > >> > > >> > > >>> > > > 110%3A+Add+Codec+for+ZStandard+Compression > > >> > > >> > > >>> > > > > > >> > > >> > > >>> > > > Please have a look when you are free. > > >> > > >> > > >>> > > > > > >> > > >> > > >>> > > > Best, > > >> > > >> > > >>> > > > Dongjin > > >> > > >> > > >>> > > > > > >> > > >> > > >>> > > > -- > > >> > > >> > > >>> > > > *Dongjin Lee* > > >> > > >> > > >>> > > > > > >> > > >> > > >>> > > > > > >> > > >> > > >>> > > > *Software developer in Line+.So interested in > > >> > > massive-scale > > >> > > >> > > machine > > >> > > >> > > >>> > > > learning.facebook: www.facebook.com/dongjin.lee. > kr > > >> > > >> > > >>> > > > linkedin: > > >> > > >> > > >>> > > > kr.linkedin.com/in/dongjinleekr > > >> > > >> > > >>> > > > github: > > >> > > >> > > >>> > > > github.com/dongjinleekr > > >> > > >> > > >>> > > > twitter: www.twitter.com/dongjinleekr > > >> > > >> > > >>> > > > * > > >> > > >> > > >>> > > > > > >> > > >> > > >>> > > > > >> > > >> > > >>> > > > >> > > >> > > >>> > > > >> > > >> > > >>> > > > >> > > >> > > >>> > -- > > >> > > >> > > >>> > *Dongjin Lee* > > >> > > >> > > >>> > > > >> > > >> > > >>> > > > >> > > >> > > >>> > *Software developer in Line+.So interested in > > >> massive-scale > > >> > > >> machine > > >> > > >> > > >>> > learning.facebook: www.facebook.com/dongjin.lee.kr > > >> > > >> > > >>> > linkedin: > > >> > > >> > > >>> > kr.linkedin.com/in/dongjinleekr > > >> > > >> > > >>> > github: > > >> > > >> > > >>> > github.com/dongjinleekr > > >> > > >> > > >>> > twitter: www.twitter.com/dongjinleekr > > >> > > >> > > >>> > * > > >> > > >> > > >>> > > > >> > > >> > > >>> > > >> > > >> > > >>> > > >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > -- > > >> > > >> > > > *Dongjin Lee* > > >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > *Software developer in Line+.So interested in > massive-scale > > >> > > machine > > >> > > >> > > > learning.facebook: www.facebook.com/dongjin.lee.kr > > >> > > >> > > > <http://www.facebook.com/dongjin.lee.kr>linkedin: > > >> > > >> kr.linkedin.com/in/ > > >> > > >> > > dongjinleekr > > >> > > >> > > > <http://kr.linkedin.com/in/dongjinleekr>github: > > >> > > >> > > > <http://goog_969573159/>github.com/dongjinleekr > > >> > > >> > > > <http://github.com/dongjinleekr>twitter: > > >> > > >> www.twitter.com/dongjinleekr > > >> > > >> > > > <http://www.twitter.com/dongjinleekr>* > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > > > > >> > > > > > >> > > > > > >> > > > -- > > >> > > > *Dongjin Lee* > > >> > > > > > >> > > > > > >> > > > *Software developer in Line+.So interested in massive-scale > > machine > > >> > > > learning.facebook: www.facebook.com/dongjin.lee.kr > > >> > > > <http://www.facebook.com/dongjin.lee.kr>linkedin: > > >> > > kr.linkedin.com/in/dongjinleekr > > >> > > > <http://kr.linkedin.com/in/dongjinleekr>github: > > >> > > > <http://goog_969573159/>github.com/dongjinleekr > > >> > > > <http://github.com/dongjinleekr>twitter: > > >> www.twitter.com/dongjinleekr > > >> > > > <http://www.twitter.com/dongjinleekr>* > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > *Dongjin Lee* > > >> > > > > >> > > > > >> > > *Software developer in Line+.So interested in massive-scale > machine > > >> > > learning.facebook: www.facebook.com/dongjin.lee.kr > > >> > > <http://www.facebook.com/dongjin.lee.kr>linkedin: > > >> > > kr.linkedin.com/in/dongjinleekr > > >> > > <http://kr.linkedin.com/in/dongjinleekr>github: > > >> > > <http://goog_969573159/>github.com/dongjinleekr > > >> > > <http://github.com/dongjinleekr>twitter: > > www.twitter.com/dongjinleekr > > >> > > <http://www.twitter.com/dongjinleekr>* > > >> > > > > >> > > > >> > > > > > > > > > > > > -- > > > *Dongjin Lee* > > > > > > > > > *Software developer in Line+.So interested in massive-scale machine > > > learning.facebook: www.facebook.com/dongjin.lee.kr > > > <http://www.facebook.com/dongjin.lee.kr>linkedin: kr.linkedin.com/in/ > > dongjinleekr > > > <http://kr.linkedin.com/in/dongjinleekr>github: > > > <http://goog_969573159/>github.com/dongjinleekr > > > <http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr > > > <http://www.twitter.com/dongjinleekr>* > > > > > > -- *Dongjin Lee* *A hitchhiker in the mathematical world.* *github: <http://goog_969573159/>github.com/dongjinleekr <http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr <http://kr.linkedin.com/in/dongjinleekr>slideshare: www.slideshare.net/dongjinleekr <http://www.slideshare.net/dongjinleekr>*