Gwen, The version we are currently running in production is the trunk on Feb 24. Which has KAFKA-3025.
Our release test cluster has been running this version for about two months, I haven't seen throughput issues so far. But we are probably not running at the max capacity of the brokers. I will setup some throughput test and see if I can reproduce this issue. Thanks, Jiangjie (Becket) Qin On Fri, May 13, 2016 at 11:41 AM, Gwen Shapira <g...@confluent.io> wrote: > Becket, > > Did you try deploying one of the 0.10.0 candidates at LinkedIn? Did > you see this issue? > > Gwen > > On Fri, May 13, 2016 at 10:30 AM, Becket Qin <becket....@gmail.com> wrote: > > Tom, > > > > Maybe it is mentioned and I missed. I am wondering if you see performance > > degradation on the consumer side when TLS is used? This could help us > > understand whether the issue is only producer related or TLS in general. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tcrayf...@heroku.com> > wrote: > > > >> Ismael, > >> > >> Thanks. I'm writing up an issue with some new findings since yesterday > >> right now. > >> > >> Thanks > >> > >> Tom > >> > >> On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <ism...@juma.me.uk> wrote: > >> > >> > Hi Tom, > >> > > >> > That's because JIRA is in lockdown due to excessive spam. I have added > >> you > >> > as a contributor in JIRA and you should be able to file a ticket now. > >> > > >> > Thanks, > >> > Ismael > >> > > >> > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tcrayf...@heroku.com> > >> > wrote: > >> > > >> > > Ok, I don't seem to be able to file a new Jira issue at all. Can > >> somebody > >> > > check my permissions on Jira? My user is `tcrayford-heroku` > >> > > > >> > > Tom Crayford > >> > > Heroku Kafka > >> > > > >> > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <j...@confluent.io> wrote: > >> > > > >> > > > Tom, > >> > > > > >> > > > We don't have a CSV metrics reporter in the producer right now. > The > >> > > metrics > >> > > > will be available in jmx. You can find out the details in > >> > > > > http://kafka.apache.org/documentation.html#new_producer_monitoring > >> > > > > >> > > > Thanks, > >> > > > > >> > > > Jun > >> > > > > >> > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford < > tcrayf...@heroku.com> > >> > > > wrote: > >> > > > > >> > > > > Yep, I can try those particular commits tomorrow. Before I try a > >> > > bisect, > >> > > > > I'm going to replicate with a less intensive to iterate on > smaller > >> > > scale > >> > > > > perf test. > >> > > > > > >> > > > > Jun, inline: > >> > > > > > >> > > > > On Thursday, 12 May 2016, Jun Rao <j...@confluent.io> wrote: > >> > > > > > >> > > > > > Tom, > >> > > > > > > >> > > > > > Thanks for reporting this. A few quick comments. > >> > > > > > > >> > > > > > 1. Did you send the right command for producer-perf? The > command > >> > > limits > >> > > > > the > >> > > > > > throughput to 100 msgs/sec. So, not sure how a single producer > >> can > >> > > get > >> > > > > 75K > >> > > > > > msgs/sec. > >> > > > > > >> > > > > > >> > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, > was > >> > > > > interpolating variables into a shell script. > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > 2. Could you collect some stats (e.g. average batch size) in > the > >> > > > producer > >> > > > > > and see if there is any noticeable difference between 0.9 and > >> 0.10? > >> > > > > > >> > > > > > >> > > > > That'd just be hooking up the CSV metrics reporter right? > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > 3. Is the broker-to-broker communication also on SSL? Could > you > >> do > >> > > > > another > >> > > > > > test with replication factor 1 and see if you still see the > >> > > > degradation? > >> > > > > > >> > > > > > >> > > > > Interbroker replication is always SSL in all test runs so far. I > >> can > >> > > try > >> > > > > with replication factor 1 tomorrow. > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > Finally, email is probably not the best way to discuss > >> performance > >> > > > > results. > >> > > > > > If you have more of them, could you create a jira and attach > your > >> > > > > findings > >> > > > > > there? > >> > > > > > >> > > > > > >> > > > > Yep. I only wrote the email because JIRA was in lockdown mode > and I > >> > > > > couldn't create new issues. > >> > > > > > >> > > > > > > >> > > > > > Thanks, > >> > > > > > > >> > > > > > Jun > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford < > >> > tcrayf...@heroku.com > >> > > > > > <javascript:;>> wrote: > >> > > > > > > >> > > > > > > We've started running our usual suite of performance tests > >> > against > >> > > > > Kafka > >> > > > > > > 0.10.0.0 RC. These tests orchestrate multiple > consumer/producer > >> > > > > machines > >> > > > > > to > >> > > > > > > run a fairly normal mixed workload of producers and > consumers > >> > (each > >> > > > > > > producer/consumer are just instances of kafka's inbuilt > >> > > > > consumer/producer > >> > > > > > > perf tests). We've found about a 33% performance drop in the > >> > > producer > >> > > > > if > >> > > > > > > TLS is used (compared to 0.9.0.1) > >> > > > > > > > >> > > > > > > We've seen notable producer performance degredations between > >> > > 0.9.0.1 > >> > > > > and > >> > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right > now. > >> > > > > > > > >> > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with > >> > enhanced > >> > > > > > > networking. Nothing is changed between the instances, and > I've > >> > > > > reproduced > >> > > > > > > this over 4 different sets of clusters now. We're seeing > about > >> a > >> > > 33% > >> > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit > >> > 9404680. > >> > > > > > Please > >> > > > > > > to note that this doesn't match up with > >> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because > our > >> > > > > > performance > >> > > > > > > tests are with compression off, and this seems to be an TLS > >> only > >> > > > issue. > >> > > > > > > > >> > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication > >> > factor > >> > > of > >> > > > > 3, > >> > > > > > > and 13 producers max out at around 1 million 100 byte > messages > >> a > >> > > > > second. > >> > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a > >> > second. > >> > > > > Both > >> > > > > > > tests were with TLS on. I've reproduced this on multiple > >> clusters > >> > > now > >> > > > > (5 > >> > > > > > or > >> > > > > > > so of each version) to account for the inherent performance > >> > > variance > >> > > > of > >> > > > > > > EC2. There's no notable performance difference without TLS > on > >> > these > >> > > > > runs > >> > > > > > - > >> > > > > > > it appears to be an TLS regression entirely. > >> > > > > > > > >> > > > > > > A single producer with TLS under 0.10 does about 75k > >> messages/s. > >> > > > Under > >> > > > > > > 0.9.0.01 it does around 120k messages/s. > >> > > > > > > > >> > > > > > > The exact producer-perf line we're using is this: > >> > > > > > > > >> > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records > >> > > > "500000000" > >> > > > > > > --record-size "100" --throughput "100" --producer-props > >> acks="-1" > >> > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks > >> > > > > > > ssl.keystore.password=REDACTED > >> ssl.truststore.location=server.jks > >> > > > > > > ssl.truststore.password=REDACTED > >> > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 > >> security.protocol=SSL > >> > > > > > > > >> > > > > > > We're using the same setup, machine type etc for each test > run. > >> > > > > > > > >> > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0 > producers > >> > and > >> > > > the > >> > > > > > TLS > >> > > > > > > performance impact was there for both. > >> > > > > > > > >> > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and > >> > haven't > >> > > > > seen > >> > > > > > > anything that seemed to have this kind of impact - indeed > the > >> TLS > >> > > > code > >> > > > > > > doesn't seem to have changed much between 0.9.0.1 and > 0.10.0.0. > >> > > > > > > > >> > > > > > > Any thoughts? Should I file an issue and see about > reproducing > >> a > >> > > more > >> > > > > > > minimal test case? > >> > > > > > > > >> > > > > > > I don't think this is related to > >> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is > for > >> > > > > > compression > >> > > > > > > on and plaintext, and this is for TLS only. > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >