[ https://issues.apache.org/jira/browse/KAFKA-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
li xiangyuan updated KAFKA-9211: -------------------------------- Attachment: nodelay.txt > kafka upgrade 2.3.0 cause produce speed decrease > ------------------------------------------------ > > Key: KAFKA-9211 > URL: https://issues.apache.org/jira/browse/KAFKA-9211 > Project: Kafka > Issue Type: Bug > Components: controller, producer > Affects Versions: 2.3.0 > Reporter: li xiangyuan > Priority: Critical > Attachments: ackdelay.txt, broker-jstack.txt, nodelay.txt, > producer-jstack.txt, producer.node.latency.png > > > Recently we try upgrade kafka from 0.10.0.1 to 2.3.0. > we have 15 clusters in production env, each one has 3~6 brokers. > we know kafka upgrade should: > 1.replcae code to 2.3.0.jar and restart all brokers one by one > 2.unset inter.broker.protocol.version=0.10.0.1 and restart all brokers > one by one > 3.unset log.message.format.version=0.10.0.1 and restart all brokers one > by one > > for now we have already done step 1 & 2 in 12 clusters.but when we try to > upgrade left clusters (already done step 1) in step 2, we found some topics > drop produce speed badly. > we have research this issue for long time, since we couldn't test it in > production environment and we couldn't reproduce in test environment, we > couldn't find the root cause. > now we only could describe the situation in detail as i know, hope anyone > could help us. > > 1.because bug KAFKA-8653, i add code below in KafkaApis.scala > handleJoinGroupRequest function: > {code:java} > if (rebalanceTimeoutMs <= 0) { > rebalanceTimeoutMs = joinGroupRequest.data.sessionTimeoutMs > }{code} > 2.one cluster upgrade failed has 6 8C16G brokers, about 200 topics with 2 > replicas,every broker keep 3000+ partitions and 1500+ leader partition, but > most of them has very low produce message speed,about less than > 50messages/sec, only one topic with 300 partitions has more than 2500 > message/sec with more than 20 consumer groups consume message from it. > so this whole cluster produce 4K messages/sec , 11m Bytes in /sec,240m Bytes > out /sec.and more than 90% traffic made by that topic has 2500messages/sec. > when we unset 5 or 6 servers' inter.broker.protocol.version=0.10.0.1 and > restart, this topic produce message drop to about 200messages/sec, i don't > know whether the way we use could tirgger any problem. > 3.we use kafka wrapped by spring-kafka and set kafkatemplate's > autoFlush=true, so each producer.send execution will execute producer.flush > immediately too.i know flush method will decrease produce performance > dramaticlly, but at least it seems nothing wrong before upgrade step 2. but > i doubt whether it's a problem now after upgrade. > 4.I noticed when produce speed decrease, some consumer group has large > message lag still consume message without any consume speed change or > decrease, so I guess only producerequest speed will drop down,but > fetchrequest not. > 5.we haven't set any throttle configuration, and all producers' acks=1(so > it's not broker replica fetch slow), and when this problem triggered, both > sever & producers cpu usage down, and servers' ioutil keep less than 30% ,so > it shuldn't be a hardware problem. > 6.this event triggered often(almost 100%) most brokers has done upgrade step > 2,then after a auto leader replica election executed, then we can observe > produce speed drop down,and we have to downgrade brokers(set > inter.broker.protocol.version=0.10.0.1)and restart brokers one by one,then it > could be normal. some cluster have to downgrade all brokers,but some cluster > could left 1 or 2 brokers without downgrade, i notice that the broker not > need downgrade is the controller. > 7.I have print jstack for producer & servers. although I do this not the same > cluster, but we can notice that their thread seems really in idle stat. > 8.both 0.10.0.1 & 2.3.0 kafka-client will trigger this problem too. > 8.unless the largest one topic will drop produce speed certainly, other topic > will drop produce speed randomly. maybe topicA will drop speed in first > upgrade attempt but next not, and topicB not drop speed in first attemp but > dropped when do another attempt. > 9.in fact, the largest cluster, has the same topic & group usage scenario > mentioned above, but the largest topic has 1w2 messages/sec,will upgrade fail > in step 1(just use 2.3.0.jar) > any help would be grateful, thx, i'm very sad now... -- This message was sent by Atlassian Jira (v8.3.4#803005)