[ 
https://issues.apache.org/jira/browse/KAFKA-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

li xiangyuan updated KAFKA-9211:
--------------------------------
    Attachment: producer.node.latency.png

> kafka upgrade 2.3.0 cause produce speed decrease
> ------------------------------------------------
>
>                 Key: KAFKA-9211
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9211
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, producer 
>    Affects Versions: 2.3.0
>            Reporter: li xiangyuan
>            Priority: Critical
>         Attachments: broker-jstack.txt, producer-jstack.txt, 
> producer.node.latency.png
>
>
> Recently we try upgrade kafka from 0.10.0.1 to 2.3.0.
> we have 15 clusters in production env, each one has 3~6 brokers.
> we know kafka upgrade should:
>       1.replcae code to 2.3.0.jar and restart  all brokers one by one
>       2.unset inter.broker.protocol.version=0.10.0.1 and restart all brokers 
> one by one
>       3.unset log.message.format.version=0.10.0.1 and restart all brokers one 
> by one
>  
> for now we have already done step 1 & 2 in 12 clusters.but when we try to 
> upgrade left clusters (already done step 1) in step 2, we found some topics 
> drop produce speed badly.
>      we have research this issue for long time, since we couldn't test it in 
> production environment  and we couldn't reproduce in test environment, we 
> couldn't find the root cause.
> now we only could describe the situation in detail as  i know, hope anyone 
> could help us.
>  
> 1.because bug KAFKA-8653, i add code below in KafkaApis.scala 
> handleJoinGroupRequest function:
> {code:java}
> if (rebalanceTimeoutMs <= 0) {
>  rebalanceTimeoutMs = joinGroupRequest.data.sessionTimeoutMs
> }{code}
> 2.one cluster upgrade failed has 6 8C16G brokers, about 200 topics with 2 
> replicas,every broker keep 3000+ partitions and 1500+ leader partition, but 
> most of them has very low produce message speed,about less than 
> 50messages/sec, only one topic with 300 partitions has more than 2500 
> message/sec with more than 20 consumer groups consume message from it.
> so this whole cluster  produce 4K messages/sec , 11m Bytes in /sec,240m Bytes 
> out /sec.and more than 90% traffic made by that topic has 2500messages/sec.
> when we unset 5 or 6 servers' inter.broker.protocol.version=0.10.0.1  and 
> restart, this topic produce message drop to about 200messages/sec,  i don't 
> know whether the way we use could tirgger any problem.
> 3.we use kafka wrapped by spring-kafka and set kafkatemplate's 
> autoFlush=true, so each producer.send execution will execute producer.flush 
> immediately too.i know flush method will decrease produce performance 
> dramaticlly, but  at least it seems nothing wrong before upgrade step 2. but 
> i doubt whether it's a problem now after upgrade.
> 4.I noticed when produce speed decrease, some consumer group has large 
> message lag still consume message without any consume speed change or 
> decrease, so I guess only producerequest speed will drop down,but 
> fetchrequest not. 
> 5.we haven't set any throttle configuration, and all producers' acks=1(so 
> it's not broker replica fetch slow), and when this problem triggered, both 
> sever & producers cpu usage down, and servers' ioutil keep less than 30% ,so 
> it shuldn't be a hardware problem.
> 6.this event triggered often(almost 100%) most brokers has done upgrade step 
> 2,then after a auto leader replica election executed, then we can observe  
> produce speed drop down,and we have to downgrade brokers(set 
> inter.broker.protocol.version=0.10.0.1)and restart brokers one by one,then it 
> could be normal. some cluster have to downgrade all brokers,but some cluster 
> could left 1 or 2 brokers without downgrade, i notice that the broker not 
> need downgrade is the controller.
> 7.I have print jstack for producer & servers. although I do this not the same 
> cluster, but we can notice that their thread seems really in idle stat.
> 8.both 0.10.0.1 & 2.3.0 kafka-client will trigger this problem too.
> 8.unless the largest one topic will drop produce speed certainly, other topic 
> will drop produce speed randomly. maybe topicA will drop speed in first 
> upgrade attempt but next not, and topicB not drop speed in first attemp but 
> dropped when do another attempt.
> 9.in fact, the largest cluster, has the same topic & group usage scenario 
> mentioned above, but the largest topic has 1w2 messages/sec,will upgrade fail 
> in step 1(just use 2.3.0.jar)
> any help would be grateful, thx, i'm very sad now...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to