[jira] [Commented] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

Joe Ammann (JIRA) Mon, 25 Mar 2019 16:02:40 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801214#comment-16801214
 ]


Joe Ammann commented on KAFKA-8151:
-----------------------------------

I still can't reproduce any symptoms in DEV when using PLAINTEXT for 
interbroker comms.

But last night we had 2 occurences of symptom 2 (all brokers and controller 
registered in ZK, but controller actions - e.g. partition leader reassignment - 
does not happen) in TEST, where I had also enabled PLAINTEXT.

So it definitely also happens with PLAINTEXT

> Broker hangs and lockups after Zookeeper outages
> ------------------------------------------------
>
>                 Key: KAFKA-8151
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8151
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, core, zkclient
>    Affects Versions: 2.1.1
>            Reporter: Joe Ammann
>            Priority: Major
>         Attachments: symptom3_lxgurten_kafka_dump1.txt, 
> symptom3_lxgurten_kafka_dump2.txt, symptom3_lxgurten_kafka_dump3.txt
>
>
> We're running several clusters (mostly with 3 brokers) with 2.1.1, where we 
> see at least 3 different symptoms, all resulting on broker/controller lockups.
> We are pretty sure that the triggering cause for all these symptoms are 
> temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs 
> where the ZK nodes run on regularly get stalled for a couple of minutes. The 
> ZK nodes always very quickly reunite and build a Quorum after the situation 
> clears, but the Kafka brokers (which run on then same Linux VMs) quite often 
> show problems after this procedure.
> I've seen 3 different kinds of problems (this is why I put "reproduce" in 
> quotes, I can never predict what will happen)
>  # the brokers get their ZK sessions expired (obviously) and sometimes only 2 
> of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register for 
> some reason (that's the problem I originally described)
>  # the brokers all re-register and re-elect a new controller. But that new 
> controller does not fully work. For example it doesn't process partition 
> reassignment requests and or does not transfer partition leadership after I 
> kill a broker
>  # the previous controller gets "dead-locked" (it has 3-4 of the important 
> controller threads in a lock) and hence does not perform any of it's 
> controller duties. But it regards itsself still as the valid controller and 
> is accepted by the other brokers
> I'll try to describe each one of the problems in more detail below, and hope 
> to be able to cleary separate them.
> I'm able to provoke these problems in our DEV environment quite regularly 
> using the following procedure
> * make sure all ZK nodes and Kafka brokers are stable and reacting normally
> * freeze 2 out of 3 ZK nodes with {{kill -STOP}} for some minutes
> * let the Kafka broker running, of course they will start complaining to be 
> unable to reach ZK
> * thaw the processes with {{kill -CONT}}
> * now all Kafka brokers get notified that their ZK session has expired, and 
> they start to reorganize the cluster
> In about 20% of the tests, I'm able to produce one of the symptoms above. I 
> can not predict which one though. I'm varying this procedure sometimes by 
> also freezing one Kafka broker (most often the controller), but until now I 
> haven't been able to create a clear pattern or really force one specific 
> symptom
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

Reply via email to