Re: [VOTE] 0.10.0.0 RC4

Becket Qin Fri, 13 May 2016 15:20:07 -0700

Gwen,

The version we are currently running in production is the trunk on Feb 24.
Which has KAFKA-3025.


Our release test cluster has been running this version for about two
months, I haven't seen throughput issues so far. But we are probably not
running at the max capacity of the brokers. I will setup some throughput
test and see if I can reproduce this issue.

Thanks,

Jiangjie (Becket) Qin


On Fri, May 13, 2016 at 11:41 AM, Gwen Shapira <[email protected]> wrote:

> Becket,
>
> Did you try deploying one of the 0.10.0 candidates at LinkedIn? Did
> you see this issue?
>
> Gwen
>
> On Fri, May 13, 2016 at 10:30 AM, Becket Qin <[email protected]> wrote:
> > Tom,
> >
> > Maybe it is mentioned and I missed. I am wondering if you see performance
> > degradation on the consumer side when TLS is used? This could help us
> > understand whether the issue is only producer related or TLS in general.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <[email protected]>
> wrote:
> >
> >> Ismael,
> >>
> >> Thanks. I'm writing up an issue with some new findings since yesterday
> >> right now.
> >>
> >> Thanks
> >>
> >> Tom
> >>
> >> On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <[email protected]> wrote:
> >>
> >> > Hi Tom,
> >> >
> >> > That's because JIRA is in lockdown due to excessive spam. I have added
> >> you
> >> > as a contributor in JIRA and you should be able to file a ticket now.
> >> >
> >> > Thanks,
> >> > Ismael
> >> >
> >> > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <[email protected]>
> >> > wrote:
> >> >
> >> > > Ok, I don't seem to be able to file a new Jira issue at all. Can
> >> somebody
> >> > > check my permissions on Jira? My user is `tcrayford-heroku`
> >> > >
> >> > > Tom Crayford
> >> > > Heroku Kafka
> >> > >
> >> > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <[email protected]> wrote:
> >> > >
> >> > > > Tom,
> >> > > >
> >> > > > We don't have a CSV metrics reporter in the producer right now.
> The
> >> > > metrics
> >> > > > will be available in jmx. You can find out the details in
> >> > > >
> http://kafka.apache.org/documentation.html#new_producer_monitoring
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jun
> >> > > >
> >> > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> [email protected]>
> >> > > > wrote:
> >> > > >
> >> > > > > Yep, I can try those particular commits tomorrow. Before I try a
> >> > > bisect,
> >> > > > > I'm going to replicate with a less intensive to iterate on
> smaller
> >> > > scale
> >> > > > > perf test.
> >> > > > >
> >> > > > > Jun, inline:
> >> > > > >
> >> > > > > On Thursday, 12 May 2016, Jun Rao <[email protected]> wrote:
> >> > > > >
> >> > > > > > Tom,
> >> > > > > >
> >> > > > > > Thanks for reporting this. A few quick comments.
> >> > > > > >
> >> > > > > > 1. Did you send the right command for producer-perf? The
> command
> >> > > limits
> >> > > > > the
> >> > > > > > throughput to 100 msgs/sec. So, not sure how a single producer
> >> can
> >> > > get
> >> > > > > 75K
> >> > > > > > msgs/sec.
> >> > > > >
> >> > > > >
> >> > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry,
> was
> >> > > > > interpolating variables into a shell script.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > 2. Could you collect some stats (e.g. average batch size) in
> the
> >> > > > producer
> >> > > > > > and see if there is any noticeable difference between 0.9 and
> >> 0.10?
> >> > > > >
> >> > > > >
> >> > > > > That'd just be hooking up the CSV metrics reporter right?
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > 3. Is the broker-to-broker communication also on SSL? Could
> you
> >> do
> >> > > > > another
> >> > > > > > test with replication factor 1 and see if you still see the
> >> > > > degradation?
> >> > > > >
> >> > > > >
> >> > > > > Interbroker replication is always SSL in all test runs so far. I
> >> can
> >> > > try
> >> > > > > with replication factor 1 tomorrow.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > Finally, email is probably not the best way to discuss
> >> performance
> >> > > > > results.
> >> > > > > > If you have more of them, could you create a jira and attach
> your
> >> > > > > findings
> >> > > > > > there?
> >> > > > >
> >> > > > >
> >> > > > > Yep. I only wrote the email because JIRA was in lockdown mode
> and I
> >> > > > > couldn't create new issues.
> >> > > > >
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > >
> >> > > > > > Jun
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> >> > [email protected]
> >> > > > > > <javascript:;>> wrote:
> >> > > > > >
> >> > > > > > > We've started running our usual suite of performance tests
> >> > against
> >> > > > > Kafka
> >> > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> consumer/producer
> >> > > > > machines
> >> > > > > > to
> >> > > > > > > run a fairly normal mixed workload of producers and
> consumers
> >> > (each
> >> > > > > > > producer/consumer are just instances of kafka's inbuilt
> >> > > > > consumer/producer
> >> > > > > > > perf tests). We've found about a 33% performance drop in the
> >> > > producer
> >> > > > > if
> >> > > > > > > TLS is used (compared to 0.9.0.1)
> >> > > > > > >
> >> > > > > > > We've seen notable producer performance degredations between
> >> > > 0.9.0.1
> >> > > > > and
> >> > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right
> now.
> >> > > > > > >
> >> > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
> >> > enhanced
> >> > > > > > > networking. Nothing is changed between the instances, and
> I've
> >> > > > > reproduced
> >> > > > > > > this over 4 different sets of clusters now. We're seeing
> about
> >> a
> >> > > 33%
> >> > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
> >> > 9404680.
> >> > > > > > Please
> >> > > > > > > to note that this doesn't match up with
> >> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because
> our
> >> > > > > > performance
> >> > > > > > > tests are with compression off, and this seems to be an TLS
> >> only
> >> > > > issue.
> >> > > > > > >
> >> > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
> >> > factor
> >> > > of
> >> > > > > 3,
> >> > > > > > > and 13 producers max out at around 1 million 100 byte
> messages
> >> a
> >> > > > > second.
> >> > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
> >> > second.
> >> > > > > Both
> >> > > > > > > tests were with TLS on. I've reproduced this on multiple
> >> clusters
> >> > > now
> >> > > > > (5
> >> > > > > > or
> >> > > > > > > so of each version) to account for the inherent performance
> >> > > variance
> >> > > > of
> >> > > > > > > EC2. There's no notable performance difference without TLS
> on
> >> > these
> >> > > > > runs
> >> > > > > > -
> >> > > > > > > it appears to be an TLS regression entirely.
> >> > > > > > >
> >> > > > > > > A single producer with TLS under 0.10 does about 75k
> >> messages/s.
> >> > > > Under
> >> > > > > > > 0.9.0.01 it does around 120k messages/s.
> >> > > > > > >
> >> > > > > > > The exact producer-perf line we're using is this:
> >> > > > > > >
> >> > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
> >> > > > "500000000"
> >> > > > > > > --record-size "100" --throughput "100" --producer-props
> >> acks="-1"
> >> > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
> >> > > > > > > ssl.keystore.password=REDACTED
> >> ssl.truststore.location=server.jks
> >> > > > > > > ssl.truststore.password=REDACTED
> >> > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> >> security.protocol=SSL
> >> > > > > > >
> >> > > > > > > We're using the same setup, machine type etc for each test
> run.
> >> > > > > > >
> >> > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
> producers
> >> > and
> >> > > > the
> >> > > > > > TLS
> >> > > > > > > performance impact was there for both.
> >> > > > > > >
> >> > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
> >> > haven't
> >> > > > > seen
> >> > > > > > > anything that seemed to have this kind of impact - indeed
> the
> >> TLS
> >> > > > code
> >> > > > > > > doesn't seem to have changed much between 0.9.0.1 and
> 0.10.0.0.
> >> > > > > > >
> >> > > > > > > Any thoughts? Should I file an issue and see about
> reproducing
> >> a
> >> > > more
> >> > > > > > > minimal test case?
> >> > > > > > >
> >> > > > > > > I don't think this is related to
> >> > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is
> for
> >> > > > > > compression
> >> > > > > > > on and plaintext, and this is for TLS only.
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: [VOTE] 0.10.0.0 RC4

Reply via email to