Tom Crayford created KAFKA-3976:
-----------------------------------

             Summary: Kafka Controller gets stuck, potentially due to zookeeper 
session id reuse
                 Key: KAFKA-3976
                 URL: https://issues.apache.org/jira/browse/KAFKA-3976
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.10.0.0, 0.9.0.1
            Reporter: Tom Crayford


We've been seeing an issue with a kafka controller getting "stuck" under 
0.9.0.1. We carefully monitor {{ActiveControllerCount}} on all brokers, and 
upon finding it to be 0 for long enough, page an operator. In this particular 
case (and we've seen similar a few times before), it appears as though there 
were some weird conditions leading to the cluster not re-electing a controller.

This could also potentially be a zookeeper client bug. We've spent a while 
looking through the code, and it seems like a session gets re-established, but 
the zookeeper session state callbacks don't seem to get fired on all the 
listeners properly, which leads to the controller being stuck.

Note that with the cluster in this bad state, we've got a znode at /controller 
and /brokers/ids/0 on this bad node:

{code}
get /controller
{"version":1,"brokerid":0,"timestamp":"1468870986582"}
cZxid = 0x20000a646
ctime = Mon Jul 18 19:43:06 UTC 2016
mZxid = 0x20000a646
mtime = Mon Jul 18 19:43:06 UTC 2016
pZxid = 0x20000a646
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x155e49f2b230004
dataLength = 54
numChildren = 0
get /brokers/ids/0
{"jmx_port":-1,"timestamp":"1468870990738","endpoints":["SSL://REDACTED"],"host":null,"version":2,"port":-1}
cZxid = 0x20000a64e
ctime = Mon Jul 18 19:43:10 UTC 2016
mZxid = 0x20000a64e
mtime = Mon Jul 18 19:43:10 UTC 2016
pZxid = 0x20000a64e
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x155e49f2b230004
dataLength = 144
numChildren = 0
{code}

However, that broker definitely doesn't have any {{ActiveControllerCount}} 
(this is a response out of jolokia):

{code}
{"request":{"mbean":"kafka.controller:name=ActiveControllerCount,type=KafkaController","type":"read"},"value":{"Value":0},"timestamp":1468936463,"status":200}
{code}

If you look at the attached log files, you can see that *both* brokers went 
through a brief period where they thought they were the controller, then both 
resigned and the callbacks meant to fire upon establishment of new sessions did 
not fire properly, which leads to a "stuck" controller.

It's perhaps worth noting that these brokers have *no* more log output at all 
since this point. They are relatively unused though, so that is not that 
surprising.

Logs are here (this is a two broker cluster): 
https://gist.github.com/tcrayford/cbf3d5aa1c46194eb7a98786d83b0eab
Sensitive data have been redacted, replaced with REDACTED_$TYPE
The most interesting stuff happens right at the bottom of each log file, where 
it appears a zookeeper session is timed out, then re-established. From my 
understanding of the ZAB protocol, Kafka shouldn't reuse session ids here, but 
it seems to be. That's potentially the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to