Re: [VOTE] 0.10.0.0 RC4

Tom Crayford Sun, 15 May 2016 13:28:24 -0700

https://github.com/apache/kafka/pull/1389


On Sun, May 15, 2016 at 9:22 PM, Ismael Juma <ism...@juma.me.uk> wrote:

> Hi Tom,
>
> Great to hear that the failure testing scenario went well. :)
>
> Your suggested improvement sounds good to me and a PR would be great. For
> this kind of change, you can skip the JIRA, just prefix the PR title with
> `MINOR:`.
>
> Thanks,
> Ismael
>
> On Sun, May 15, 2016 at 9:17 PM, Tom Crayford <tcrayf...@heroku.com>
> wrote:
>
> > How about this?
> >
> >     <b>Note:</b> Due to the additional timestamp introduced in each
> message
> > (8 bytes of data), producers sending small messages may see a
> >     message throughput degradation because of the increased overhead.
> > Likewise, replication now transmits an additional 8 bytes per message.
> >     If you're running close to the network capacity of your cluster, it's
> > possible that you'll overwhelm the network cards and see failures and
> > performance
> >     issues due to the overload.
> >     When receiving compressed messages, 0.10.0
> >     brokers avoid recompressing the messages, which in general reduces
> the
> > latency and improves the throughput. In
> >     certain cases, this may reduce the batching size on the producer,
> which
> > could lead to worse throughput. If this
> >     happens, users can tune linger.ms and batch.size of the producer for
> > better throughput.
> >
> > Would you like a Jira/PR with this kind of change so we can discuss them
> in
> > a more convenient format?
> >
> > Re our failure testing scenario: Kafka 0.10 RC behaves exactly the same
> > under failure as 0.9 - the controller typically shifts the leader in
> around
> > 2 seconds or so, and the benchmark sees a small drop in throughput during
> > that, then another drop whilst the replacement broker comes back to
> speed.
> > So, overall we're extremely happy and excited for this release! Thanks to
> > the committers and maintainers for all their hard work.
> >
> > On Sun, May 15, 2016 at 9:03 PM, Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > Hi Tom,
> > >
> > > Thanks for the update and for all the testing you have done! No worries
> > > about the chase here, I'd much rather have false positives by people
> who
> > > are validating the releases than false negatives because people don't
> > > validate the releases. :)
> > >
> > > The upgrade note we currently have follows:
> > >
> > > https://github.com/apache/kafka/blob/0.10.0/docs/upgrade.html#L67
> > >
> > > Please feel free to suggest improvements.
> > >
> > > Thanks,
> > > Ismael
> > >
> > > On Sun, May 15, 2016 at 6:39 PM, Tom Crayford <tcrayf...@heroku.com>
> > > wrote:
> > >
> > > > I've been digging into this some more. It seems like this may have
> been
> > > an
> > > > issue with benchmarks maxing out the network card - under 0.10.0.0-RC
> > the
> > > > slightly additional bandwidth per message seems to have pushed the
> > > broker's
> > > > NIC into overload territory where it starts dropping packets
> (verified
> > > with
> > > > ifconfig on each broker). This leads to it not being able to talk to
> > > > Zookeeper properly, which leads to OfflinePartitions, which then
> causes
> > > > issues with the benchmarks validity, as throughput drops a lot when
> > > brokers
> > > > are flapping in and out of being online. 0.9.0.1 doing that 8 bytes
> > less
> > > > per message means the broker's NIC can sustain more messages/s. There
> > was
> > > > an "alignment" issue with the benchmarks here - under 0.9 we were
> > *just*
> > > at
> > > > the barrier of the broker's NICs sustaining traffic, and under 0.10
> we
> > > > pushed over that (at 1.5 million messages/s, 8 bytes extra per
> message
> > is
> > > > an extra 36 MB/s with replication factor 3 [if my math is right, and
> > > that's
> > > > before SSL encryption which may be additional overhead], which is as
> > much
> > > > as an additional producer machine).
> > > >
> > > > The dropped packets and the flapping weren't causing notable timeout
> > > issues
> > > > in the producer, but looking at the metrics on the brokers, offline
> > > > partitions was clearly triggered and undergoing, and the broker logs
> > show
> > > > ZK session timeouts. This is consistent with earlier benchmarking
> > > > experience - the number of producers we were running under 0.9.0.1
> was
> > > > carefully selected to be just under the limit here.
> > > >
> > > > The other issue with the benchmark where I reported an issue between
> > two
> > > > single producers was caused by a "performance of producer machine"
> > issue
> > > > that I wasn't properly aware of. Apologies there.
> > > >
> > > > I've done benchmarks now where I limit the producer throughput (via
> > > > --throughput) to slightly below what the NICs can sustain and seen no
> > > > notable performance or stability difference between 0.10 and 0.9.0.1
> as
> > > > long as you stay under the limits of the network interfaces. All of
> the
> > > > clusters I have tested happily keep up a benchmark at this rate for 6
> > > hours
> > > > under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters
> > are
> > > > entirely network bound in these producer benchmarking scenarios - the
> > > disks
> > > > and CPU/memory have a bunch of remaining capacity.
> > > >
> > > > This was pretty hard to verify fully, which is why I've taken so long
> > to
> > > > reply. All in all I think the result here is expected and not a
> blocker
> > > for
> > > > release, but a good thing to note on upgrades - if folk are running
> at
> > > the
> > > > limit of their network cards (which you never want to do anyway, but
> > > > benchmarking scenarios often uncover those limits), they'll see
> issues
> > > due
> > > > to increased replication and producer traffic under 0.10.0.0.
> > > >
> > > > Apologies for the chase here - this distinctly seemed like a real
> issue
> > > and
> > > > one I (and I think everybody else) would have wanted to block the
> > release
> > > > on. I'm going to move onto our "failure" testing, in which we run the
> > > same
> > > > performance benchmarks whilst causing a hard kill on the node. We've
> > seen
> > > > very good results for that under 0.9 and hopefully they'll continue
> > under
> > > > 0.10.
> > > >
> > > > On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <g...@confluent.io>
> > wrote:
> > > >
> > > > > also, perhaps sharing the broker configuration? maybe this will
> > > > > provide some hints...
> > > > >
> > > > > On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <ism...@juma.me.uk>
> > > wrote:
> > > > > > Thanks Tom. I just wanted to share that I have been unable to
> > > reproduce
> > > > > > this so far. Please feel free to share whatever you information
> you
> > > > have
> > > > > so
> > > > > > far when you have a chance, don't feel that you need to have all
> > the
> > > > > > answers.
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <
> > tcrayf...@heroku.com>
> > > > > wrote:
> > > > > >
> > > > > >> I've been investigating this pretty hard since I first noticed
> it.
> > > > Right
> > > > > >> now I have more avenues for investigation than I can shake a
> stick
> > > at,
> > > > > and
> > > > > >> am also dealing with several other things in flight/on fire.
> I'll
> > > > > respond
> > > > > >> when I have more information and can confirm things.
> > > > > >>
> > > > > >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <
> becket....@gmail.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >> > Tom,
> > > > > >> >
> > > > > >> > Maybe it is mentioned and I missed. I am wondering if you see
> > > > > performance
> > > > > >> > degradation on the consumer side when TLS is used? This could
> > help
> > > > us
> > > > > >> > understand whether the issue is only producer related or TLS
> in
> > > > > general.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Jiangjie (Becket) Qin
> > > > > >> >
> > > > > >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <
> > > tcrayf...@heroku.com
> > > > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Ismael,
> > > > > >> > >
> > > > > >> > > Thanks. I'm writing up an issue with some new findings since
> > > > > yesterday
> > > > > >> > > right now.
> > > > > >> > >
> > > > > >> > > Thanks
> > > > > >> > >
> > > > > >> > > Tom
> > > > > >> > >
> > > > > >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <
> > ism...@juma.me.uk
> > > >
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > > > Hi Tom,
> > > > > >> > > >
> > > > > >> > > > That's because JIRA is in lockdown due to excessive spam.
> I
> > > have
> > > > > >> added
> > > > > >> > > you
> > > > > >> > > > as a contributor in JIRA and you should be able to file a
> > > ticket
> > > > > now.
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > > Ismael
> > > > > >> > > >
> > > > > >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <
> > > > > tcrayf...@heroku.com
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Ok, I don't seem to be able to file a new Jira issue at
> > all.
> > > > Can
> > > > > >> > > somebody
> > > > > >> > > > > check my permissions on Jira? My user is
> > `tcrayford-heroku`
> > > > > >> > > > >
> > > > > >> > > > > Tom Crayford
> > > > > >> > > > > Heroku Kafka
> > > > > >> > > > >
> > > > > >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <
> > j...@confluent.io
> > > >
> > > > > >> wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Tom,
> > > > > >> > > > > >
> > > > > >> > > > > > We don't have a CSV metrics reporter in the producer
> > right
> > > > > now.
> > > > > >> The
> > > > > >> > > > > metrics
> > > > > >> > > > > > will be available in jmx. You can find out the details
> > in
> > > > > >> > > > > >
> > > > > >>
> > http://kafka.apache.org/documentation.html#new_producer_monitoring
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks,
> > > > > >> > > > > >
> > > > > >> > > > > > Jun
> > > > > >> > > > > >
> > > > > >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> > > > > >> > tcrayf...@heroku.com>
> > > > > >> > > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Yep, I can try those particular commits tomorrow.
> > > Before I
> > > > > try
> > > > > >> a
> > > > > >> > > > > bisect,
> > > > > >> > > > > > > I'm going to replicate with a less intensive to
> > iterate
> > > on
> > > > > >> > smaller
> > > > > >> > > > > scale
> > > > > >> > > > > > > perf test.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Jun, inline:
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Thursday, 12 May 2016, Jun Rao <j...@confluent.io
> >
> > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Tom,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thanks for reporting this. A few quick comments.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 1. Did you send the right command for
> producer-perf?
> > > The
> > > > > >> > command
> > > > > >> > > > > limits
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a
> > single
> > > > > >> producer
> > > > > >> > > can
> > > > > >> > > > > get
> > > > > >> > > > > > > 75K
> > > > > >> > > > > > > > msgs/sec.
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > Ah yep, wrong commands. I'll get the right one
> > tomorrow.
> > > > > Sorry,
> > > > > >> > was
> > > > > >> > > > > > > interpolating variables into a shell script.
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 2. Could you collect some stats (e.g. average
> batch
> > > > size)
> > > > > in
> > > > > >> > the
> > > > > >> > > > > > producer
> > > > > >> > > > > > > > and see if there is any noticeable difference
> > between
> > > > 0.9
> > > > > and
> > > > > >> > > 0.10?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > That'd just be hooking up the CSV metrics reporter
> > > right?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 3. Is the broker-to-broker communication also on
> > SSL?
> > > > > Could
> > > > > >> you
> > > > > >> > > do
> > > > > >> > > > > > > another
> > > > > >> > > > > > > > test with replication factor 1 and see if you
> still
> > > see
> > > > > the
> > > > > >> > > > > > degradation?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > Interbroker replication is always SSL in all test
> runs
> > > so
> > > > > far.
> > > > > >> I
> > > > > >> > > can
> > > > > >> > > > > try
> > > > > >> > > > > > > with replication factor 1 tomorrow.
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Finally, email is probably not the best way to
> > discuss
> > > > > >> > > performance
> > > > > >> > > > > > > results.
> > > > > >> > > > > > > > If you have more of them, could you create a jira
> > and
> > > > > attach
> > > > > >> > your
> > > > > >> > > > > > > findings
> > > > > >> > > > > > > > there?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > Yep. I only wrote the email because JIRA was in
> > lockdown
> > > > > mode
> > > > > >> > and I
> > > > > >> > > > > > > couldn't create new issues.
> > > > > >> > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thanks,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Jun
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> > > > > >> > > > tcrayf...@heroku.com
> > > > > >> > > > > > > > <javascript:;>> wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > We've started running our usual suite of
> > performance
> > > > > tests
> > > > > >> > > > against
> > > > > >> > > > > > > Kafka
> > > > > >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> > > > > >> > consumer/producer
> > > > > >> > > > > > > machines
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > > run a fairly normal mixed workload of producers
> > and
> > > > > >> consumers
> > > > > >> > > > (each
> > > > > >> > > > > > > > > producer/consumer are just instances of kafka's
> > > > inbuilt
> > > > > >> > > > > > > consumer/producer
> > > > > >> > > > > > > > > perf tests). We've found about a 33% performance
> > > drop
> > > > in
> > > > > >> the
> > > > > >> > > > > producer
> > > > > >> > > > > > > if
> > > > > >> > > > > > > > > TLS is used (compared to 0.9.0.1)
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > We've seen notable producer performance
> > degredations
> > > > > >> between
> > > > > >> > > > > 0.9.0.1
> > > > > >> > > > > > > and
> > > > > >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit
> > 9404680
> > > > > right
> > > > > >> > now.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Our specific test case runs Kafka on 8 EC2
> > machines,
> > > > > with
> > > > > >> > > > enhanced
> > > > > >> > > > > > > > > networking. Nothing is changed between the
> > > instances,
> > > > > and
> > > > > >> > I've
> > > > > >> > > > > > > reproduced
> > > > > >> > > > > > > > > this over 4 different sets of clusters now.
> We're
> > > > seeing
> > > > > >> > about
> > > > > >> > > a
> > > > > >> > > > > 33%
> > > > > >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as
> > of
> > > > > commit
> > > > > >> > > > 9404680.
> > > > > >> > > > > > > > Please
> > > > > >> > > > > > > > > to note that this doesn't match up with
> > > > > >> > > > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-3565,
> > > > > because
> > > > > >> > our
> > > > > >> > > > > > > > performance
> > > > > >> > > > > > > > > tests are with compression off, and this seems
> to
> > be
> > > > an
> > > > > TLS
> > > > > >> > > only
> > > > > >> > > > > > issue.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with
> > > > > replication
> > > > > >> > > > factor
> > > > > >> > > > > of
> > > > > >> > > > > > > 3,
> > > > > >> > > > > > > > > and 13 producers max out at around 1 million 100
> > > byte
> > > > > >> > messages
> > > > > >> > > a
> > > > > >> > > > > > > second.
> > > > > >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million
> > > > > messages a
> > > > > >> > > > second.
> > > > > >> > > > > > > Both
> > > > > >> > > > > > > > > tests were with TLS on. I've reproduced this on
> > > > multiple
> > > > > >> > > clusters
> > > > > >> > > > > now
> > > > > >> > > > > > > (5
> > > > > >> > > > > > > > or
> > > > > >> > > > > > > > > so of each version) to account for the inherent
> > > > > performance
> > > > > >> > > > > variance
> > > > > >> > > > > > of
> > > > > >> > > > > > > > > EC2. There's no notable performance difference
> > > without
> > > > > TLS
> > > > > >> on
> > > > > >> > > > these
> > > > > >> > > > > > > runs
> > > > > >> > > > > > > > -
> > > > > >> > > > > > > > > it appears to be an TLS regression entirely.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > A single producer with TLS under 0.10 does about
> > 75k
> > > > > >> > > messages/s.
> > > > > >> > > > > > Under
> > > > > >> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > The exact producer-perf line we're using is
> this:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench"
> > > > > --num-records
> > > > > >> > > > > > "500000000"
> > > > > >> > > > > > > > > --record-size "100" --throughput "100"
> > > > --producer-props
> > > > > >> > > acks="-1"
> > > > > >> > > > > > > > > bootstrap.servers=REDACTED
> > > > > ssl.keystore.location=client.jks
> > > > > >> > > > > > > > > ssl.keystore.password=REDACTED
> > > > > >> > > ssl.truststore.location=server.jks
> > > > > >> > > > > > > > > ssl.truststore.password=REDACTED
> > > > > >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> > > > > >> > > security.protocol=SSL
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > We're using the same setup, machine type etc for
> > > each
> > > > > test
> > > > > >> > run.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > We've tried using both 0.9.0.1 producers and
> > > 0.10.0.0
> > > > > >> > producers
> > > > > >> > > > and
> > > > > >> > > > > > the
> > > > > >> > > > > > > > TLS
> > > > > >> > > > > > > > > performance impact was there for both.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > I've glanced over the code between 0.9.0.1 and
> > > > 0.10.0.0
> > > > > and
> > > > > >> > > > haven't
> > > > > >> > > > > > > seen
> > > > > >> > > > > > > > > anything that seemed to have this kind of
> impact -
> > > > > indeed
> > > > > >> the
> > > > > >> > > TLS
> > > > > >> > > > > > code
> > > > > >> > > > > > > > > doesn't seem to have changed much between
> 0.9.0.1
> > > and
> > > > > >> > 0.10.0.0.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Any thoughts? Should I file an issue and see
> about
> > > > > >> > reproducing
> > > > > >> > > a
> > > > > >> > > > > more
> > > > > >> > > > > > > > > minimal test case?
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > I don't think this is related to
> > > > > >> > > > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-3565
> > -
> > > > > that is
> > > > > >> > for
> > > > > >> > > > > > > > compression
> > > > > >> > > > > > > > > on and plaintext, and this is for TLS only.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] 0.10.0.0 RC4

Reply via email to