Re: Kafka cluster performance degradation (Kafka 0.8.2.1)

Gwen Shapira Thu, 21 Jan 2016 18:37:57 -0800

We fixed two serious Snappy bugs for 0.8.2.2, so you may want to check that.


On Thu, Jan 21, 2016 at 8:16 AM, Cliff Rhyne <crh...@signal.co> wrote:

> Hi Leo,
>
> I'm not sure if this is the issue you're encountering, but this is what we
> found when we went from 0.8.1.1 to 0.8.2.1.
>
> Snappy compression didn't work as expected.  Something in the library broke
> compressing bundles of messages and each message was compressed
> individually (which for us caused a lot of overhead).  Disk usage went way
> up and CPU usage went incrementally up (still under 1%).  I didn't monitor
> latency, it was well within the tolerances of our system.  We resolved this
> issue by switching our compression to gzip.
>
> This issue is supposedly fixed in 0.9.0.0 but we haven't verified it yet.
>
> Cliff
>
> On Thu, Jan 21, 2016 at 4:04 AM, Clelio De Souza <cleli...@gmail.com>
> wrote:
>
> > Hi all,
> >
> >
> > We are using Kafka in production and we have been facing some performance
> > degradation of the cluster, apparently after the cluster is a bit "old".
> >
> >
> > We have our production cluster which is up and running since 31/12/2015
> and
> > performance tests on our application measuring a full round trip of TCP
> > packets and Kafka producing/consumption of data (3 hops in total for
> every
> > single TCP packet being sent, persisted and consumed in the other end).
> The
> > results for the production cluster shows a latency of ~ 130ms to 200ms.
> >
> >
> > In our Test environment we have the very same software and specification
> in
> > AWS instances, i.e. Test environment as being a mirror of Prod. The Kafka
> > cluster has been running in Test since 18/12/2015 and the same
> performance
> > tests (as described above) shows a increase of latency to ~ 800ms to
> > 1000ms.
> >
> >
> > We have just recently setup a new fresh Kafka cluster (on 18/01/2016)
> > trying to get to the bottom of this performance degradation problem and
> in
> > the new Kafka cluster deployed in Test in replacement of the original
> Test
> > Kafka cluster we found a very small latency of ~ 10ms to 15ms.
> >
> >
> > We are using Kafka 0.8.2.1 version for all those environment mentioned
> > above. And the same cluster configuration has been setup on all of them,
> as
> > 3 brokers as m3.xlarge AWS instances. The amount of data and Kafka topics
> > are roughly the same among those environments, therefore the performance
> > degradation seems to be not directly related to the amount of data in the
> > cluster. We suspect that something running inside of the Kafka cluster,
> > such as repartitioning or log rentention (even though our topics are to
> > setup to last for ~ 2 years and it has not elapsed this time at all).
> >
> >
> > The Kafka broker config can be found as below. If anyone could shed some
> > lights on what it may be causing the performance degradation for our
> Kafka
> > cluster, it would be great and very much appreciate it.
> >
> >
> > Thanks,
> > Leo
> >
> > --------------------
> >
> >
> > # Licensed to the Apache Software Foundation (ASF) under one or more
> > # contributor license agreements.  See the NOTICE file distributed with
> > # this work for additional information regarding copyright ownership.
> > # The ASF licenses this file to You under the Apache License, Version 2.0
> > # (the "License"); you may not use this file except in compliance with
> > # the License.  You may obtain a copy of the License at
> > #
> > #    http://www.apache.org/licenses/LICENSE-2.0
> > #
> > # Unless required by applicable law or agreed to in writing, software
> > # distributed under the License is distributed on an "AS IS" BASIS,
> > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> > # See the License for the specific language governing permissions and
> > # limitations under the License.
> > # see kafka.server.KafkaConfig for additional details and defaults
> >
> > ############################# Server Basics #############################
> >
> > # The id of the broker. This must be set to a unique integer for each
> > broker.
> > broker.id=<broker_num>
> >
> > ############################# Socket Server Settings
> > #############################
> >
> > # The port the socket server listens on
> > port=9092
> >
> > # Hostname the broker will bind to. If not set, the server will bind to
> all
> > interfaces
> > #host.name=localhost
> >
> > # Hostname the broker will advertise to producers and consumers. If not
> > set, it uses the
> > # value for "host.name" if configured.  Otherwise, it will use the value
> > returned from
> > # java.net.InetAddress.getCanonicalHostName().
> > #advertised.host.name=<hostname routable by clients>
> >
> > # The port to publish to ZooKeeper for clients to use. If this is not
> set,
> > # it will publish the same port that the broker binds to.
> > #advertised.port=<port accessible by clients>
> >
> > # The number of threads handling network requests
> > num.network.threads=3
> >
> > # The number of threads doing disk I/O
> > num.io.threads=8
> >
> > # The send buffer (SO_SNDBUF) used by the socket server
> > socket.send.buffer.bytes=102400
> >
> > # The receive buffer (SO_RCVBUF) used by the socket server
> > socket.receive.buffer.bytes=102400
> >
> > # The maximum size of a request that the socket server will accept
> > (protection against OOM)
> > socket.request.max.bytes=104857600
> >
> > ############################# Log Basics #############################
> >
> > # A comma seperated list of directories under which to store log files
> > log.dirs=/data/kafka/logs
> >
> > # The default number of log partitions per topic. More partitions allow
> > greater
> > # parallelism for consumption, but this will also result in more files
> > across
> > # the brokers.
> > num.partitions=8
> >
> > # The number of threads per data directory to be used for log recovery at
> > startup and flushing at shutdown.
> > # This value is recommended to be increased for installations with data
> > dirs located in RAID array.
> > num.recovery.threads.per.data.dir=1
> >
> > ############################# Log Flush Policy
> > #############################
> >
> > # Messages are immediately written to the filesystem but by default we
> only
> > fsync() to sync
> > # the OS cache lazily. The following configurations control the flush of
> > data to disk.
> > # There are a few important trade-offs here:
> > #    1. Durability: Unflushed data may be lost if you are not using
> > replication.
> > #    2. Latency: Very large flush intervals may lead to latency spikes
> when
> > the flush does occur as there will be a lot of data to flush.
> > #    3. Throughput: The flush is generally the most expensive operation,
> > and a small flush interval may lead to exceessive seeks.
> > # The settings below allow one to configure the flush policy to flush
> data
> > after a period of time or
> > # every N messages (or both). This can be done globally and overridden
> on a
> > per-topic basis.
> >
> > # The number of messages to accept before forcing a flush of data to disk
> > #log.flush.interval.messages=10000
> >
> > # The maximum amount of time a message can sit in a log before we force a
> > flush
> > #log.flush.interval.ms=1000
> >
> > ############################# Log Retention Policy
> > #############################
> >
> > # The following configurations control the disposal of log segments. The
> > policy can
> > # be set to delete segments after a period of time, or after a given size
> > has accumulated.
> > # A segment will be deleted whenever *either* of these criteria are met.
> > Deletion always happens
> > # from the end of the log.
> >
> > # The minimum age of a log file to be eligible for deletion
> > # Failsafe is we don't lose any messages for 20+ years, topics should
> > # be configured individually
> > log.retention.hours=200000
> >
> > # A size-based retention policy for logs. Segments are pruned from the
> log
> > as long as the remaining
> > # segments don't drop below log.retention.bytes.
> > #log.retention.bytes=1073741824
> >
> > # The maximum size of a log segment file. When this size is reached a new
> > log segment will be created.
> > log.segment.bytes=1073741824
> >
> > # The interval at which log segments are checked to see if they can be
> > deleted according
> > # to the retention policies
> > log.retention.check.interval.ms=300000
> >
> > # By default the log cleaner is disabled and the log retention policy
> will
> > default to just delete segments after their retention expires.
> > # If log.cleaner.enable=true is set the cleaner will be enabled and
> > individual logs can then be marked for log compaction.
> > log.cleaner.enable=false
> >
> > default.replication.factor=3
> >
> > auto.create.topics.enable=true
> >
> > controlled.shutdown.enable=true
> >
> > delete.topic.enable=true
> >
> > ############################# Zookeeper #############################
> >
> > # Zookeeper connection string (see zookeeper docs for details).
> > # This is a comma separated host:port pairs, each corresponding to a zk
> > # server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
> > # You can also append an optional chroot string to the urls to specify
> the
> > # root directory for all kafka znodes.
> >
> zookeeper.connect=<zk1-address>:2181,<zk2-address>:2181,<zk3-address>:2181
> >
> > # Timeout in ms for connecting to zookeeper
> > zookeeper.connection.timeout.ms=6000
> >
>
>
>
> --
> Cliff Rhyne
> Software Engineering Lead
> e: crh...@signal.co
> signal.co
> ________________________
>
> Cut Through the Noise
>
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized use of this email is strictly prohibited.
> ©2015 Signal. All rights reserved.
>

Re: Kafka cluster performance degradation (Kafka 0.8.2.1)

Reply via email to