Jason Rosenberg created KAFKA-2125:
--------------------------------------

             Summary: Infinite loop after controlled shutdown succeeds
                 Key: KAFKA-2125
                 URL: https://issues.apache.org/jira/browse/KAFKA-2125
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.8.2.1
            Reporter: Jason Rosenberg
            Priority: Blocker


I have a 4 node cluster, running 0.8.2.1, that got into a bad state last night 
during a rolling restart.  The first node to be restarted was the controller.  
Controlled Shutdown completed successfully, after about 800ms.  But after that, 
the server failed to shutdown, and got into what appears to be an infinite 
cycle, involving the 'Controller-45-to-broker-45-send-thread', and the 
'kafka-scheduler-0' thread.  Ultimately, the shutdown timed out (after 3 
minutes) and it was terminated by the deployment system, and the server was 
restarted.  As expected, when it came back up it took some time re-syncing it's 
partitions, but eventually came back and all was well.  However, I think there 
was something fundamentally wrong with the shutdown process.  The controller 
didn't seem to give up controller status, for one thing, as part of the 
controlled shutdown (which I should think would be the first thing it should 
do?).

Anyway, below are some log snippets. I do have full logs from each broker in 
the cluster, which I can provide (but would have to significantly anonymize the 
logs before forwarding along).

Controlled shutdown starts and succeeds:
{code}
2015-04-14 05:56:10,134  INFO [Thread-38] server.KafkaServer - [Kafka Server 
45], shutting down
2015-04-14 05:56:10,136  INFO [Thread-38] server.KafkaServer - [Kafka Server 
45], Starting controlled shutdown
2015-04-14 05:56:10,150  INFO [kafka-request-handler-0] 
controller.KafkaController - [Controller 45]: Shutting down broker 45
...
...
2015-04-14 05:56:10,951  INFO [Thread-38] server.KafkaServer - [Kafka Server 
45], Controlled shutdown succeeded
{code}

Subsequently, the 'Controller-45-to-broker-45-send-thread' starts repeatedly 
spamming the logs, like so:
{code}
2015-04-14 05:56:11,281  WARN [Controller-45-to-broker-45-send-thread] 
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
Controller 45 epoch 21 fails to send request 
Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
 -> 
(LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
 to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
java.nio.channels.ClosedChannelException
        at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
        at 
kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
        at 
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
2015-04-14 05:56:11,281  INFO [Controller-45-to-broker-45-send-thread] 
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
Controller 45 connected to id:45,host:broker45.square,port:12345 for sending 
state change requests
2015-04-14 05:56:11,582  WARN [Controller-45-to-broker-45-send-thread] 
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
Controller 45 epoch 21 fails to send request 
Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
 -> 
(LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
 to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
java.nio.channels.ClosedChannelException
        at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
        at 
kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
        at 
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
2015-04-14 05:56:11,582  INFO [Controller-45-to-broker-45-send-thread] 
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
Controller 45 connected to id:45,host:broker45.square,port:12345 for sending 
state change requests
{code}
This seems to repeat every 300 ms or so, and continues until the server is 
forcibly shutdown (for about 3 minutes in this case).  The partition being 
logged about there (anonymized as 'mytopic,1') is always the same in 
topic/partition in these repeated log messages. So perhaps that one partition 
got in a bad state.

Interspersed, there are floods of error logs like these, which is repeated 
seemingly for every partition on the broker (the broker has about 450 
partitions).  It's trying to shrink the ISR to just having the current broker 
(EVEN THOUGH IT JUST COMPLETED CONTROLLED SHUTDOWN for all partitions). The 
following sequence goes on in bursts of 500ms, once for each partition on the 
box (and the recurring every 10 seconds):
{code}
2015-04-14 05:56:23,716  INFO [kafka-scheduler-0] cluster.Partition - Partition 
[mytopic2,0] on broker 45: Shrinking ISR for partition [mytopic2,0] from 46,45 
to 45
2015-04-14 05:56:23,720  INFO [kafka-scheduler-0] cluster.Partition - Partition 
[mytopic2,0] on broker 45: Cached zkVersion [19] not equal to that in 
zookeeper, skip updating ISR
2015-04-14 05:56:23,720  INFO [kafka-scheduler-0] cluster.Partition - Partition 
[mytopic3,13] on broker 45: Shrinking ISR for partition [mytopic3,13] from 
45,48 to 45
2015-04-14 05:56:23,723  INFO [kafka-scheduler-0] cluster.Partition - Partition 
[mytopic3,13] on broker 45: Cached zkVersion [8] not equal to that in 
zookeeper, skip updating ISR
2015-04-14 05:56:23,723  INFO [kafka-scheduler-0] cluster.Partition - Partition 
[mytopic4,12] on broker 45: Shrinking ISR for partition [mytopic4,12] from 
46,45 to 45
2015-04-14 05:56:23,726  INFO [kafka-scheduler-0] cluster.Partition - Partition 
[mytopic4,12] on broker 45: Cached zkVersion [19] not equal to that in 
zookeeper, skip updating ISR
{code}

The above sequence is repeated every 10 seconds, it seems.

So there appear to be 2 different repeating non-terminating loops:
1. the Controller-45-to-broker-45-send-thread repeating every 300ms
2. the kafka-scheduler thread attempting to shrink the ISR for every partition 
the box, every 10 seconds.

It seems clear that this sequence would recur indefinitely had the app not been 
killed hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to