Zhanxiang Huang created KAFKA-6846:
--------------------------------------

             Summary: Controller can spend long time in shutting down 
RequestSendThread when processing BrokerChange event
                 Key: KAFKA-6846
                 URL: https://issues.apache.org/jira/browse/KAFKA-6846
             Project: Kafka
          Issue Type: Bug
          Components: controller
            Reporter: Zhanxiang Huang


Controller can spend a long time (more than 60s) in processing BrokerChange 
event when there are dead brokers. For example, we saw entries like these in 
controller log:

 
{code:java}
2018/04/28 18:13:50.021 [KafkaController] [Controller 7586]: Newly added 
brokers: , deleted brokers: 5222, bounced Brokers: , all live brokers: 
3238,3322,5134,5177,5213,5214,5217,5218,5219,5220,5221,5319,5652,5949,7569,7574,7577,7581,7586,7589,7594,7595,7601,7609,14838,14840,14848,14855,14882,14886,14889,14901,16033
2018/04/28 18:13:50.021 [RequestSendThread] 
[Controller-7586-to-broker-5222-send-thread]: Shutting down
.
.
.
2018/04/28 18:14:49.196 [RequestSendThread] 
[Controller-7586-to-broker-5222-send-thread]: Shutdown completed
2018/04/28 18:14:49.196 [RequestSendThread] 
[Controller-7586-to-broker-5222-send-thread]: Stopped
2018/04/28 18:14:49.200 [KafkaController] [Controller 7586]: Broker failure 
callback for 5222{code}
 

It indicates that the time difference between RequestSendThread shutdown is 
initiated (18:13:50) and shutdown completes (18:14:49) is 59s.

The root cause is that RequestSendThread will call NetworkClient.pool() in a 
while loop in NetworkClientsUtils.awaitReady() and 
NetworkClientsUtils.sendAndReceive() without checking the interrupt flag. This 
causes the interrupt triggered by controller thread only breaks poll() for once 
and then the RequestSendThread will be blocked in the next poll() until it 
receives the disconnected message or timeout, before it can actually finish the 
shutdown. During this time period, controller event thread is blocked to wait 
for the shutdownComplete latch, which is bad because we only have single 
controller event thread.

This issue can be resolved by making the thread throw InterruptedException 
right after each poll call in awaitReady() and sendAndReceive() if it sees the 
interrupt flag has been set. I will create a PR for that.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to