Cameron created KAFKA-15766:
-------------------------------

             Summary: Possible request handler thread exhaustion in controller 
election
                 Key: KAFKA-15766
                 URL: https://issues.apache.org/jira/browse/KAFKA-15766
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.5.0
            Reporter: Cameron


Hello,

This is my first dive into the Kafka source code, so I may be mis-interpreting 
some of the logic outlined below. Please let me know if my understanding is not 
correct.

 

--- 

 

After upgrading a relatively large Kafka cluster from IPB 2.5 to 3.5 (Kafka 
version v3.5.0), we've had numerous problems with controller failover and 
subsequent elections not being properly acknowledged by a subset of the 
cluster's brokers.  These affected brokers will temporarily become unresponsive 
due to complete request handler thread starvation via failed 
ALLOCATE_PRODUCER_IDS and ALTER_PARTITION requests to the old stopped 
controller. In the case of a disconnected response, as seen here, the 
broker-to-controller channel will place these failed requests at the beginning 
of the queue, effectively deadlocking the request handler thread pool. 
Depending on how quickly the configured queue.max.requests request queue fills, 
the new controller's UPDATE_METADATA request to notify the broker of the 
controller election result may not even be able to contact the affected broker 
to update the new controller Id. 

For context, the relevant configuration of my cluster is:
 * num.network.threads = 16
 * num.io.threads = 8
 * queued.max.requests = 500

 

I am not able to consistently repro the issue, but it happens frequently enough 
to cause major pain in our set up. The general order of events are:
 # Controller election takes place (due to intentional failover or otherwise)
 # New controller successfully registers its’ znode with Zookeeper, initiates 
UPDATE_METADATA request to notify other brokers in the cluster
 # Problematic broker, noticing it has lost connection to the old controller 
but has not yet received and/or processed the incoming UPDATE_METADATA request, 
spams (every ~50ms as hardcoded) ALLOCATE_PRODUCER_IDS and ALTER_PARTITION 
requests to the old controller, receiving disconnect responses and placing the 
request back in the front of the request queue.

 

In this scenario, the needed UPDATE_METADATA request will take much longer to 
process. In our specific cluster this will sometimes auto-resolve in 3-5m 
whereas other times it will require manual intervention to bounce the broker 
and allow it to initialize a connection with the new controller. 

 

This issue seems very similar to some previous tickets I found:
 * https://issues.apache.org/jira/browse/KAFKA-7697 
 * https://issues.apache.org/jira/browse/KAFKA-14694 

Unfortunately I have not been able to grab a threaddump because of the sporadic 
nature of the issue, but will update the ticket if I am able.

This may also affect other versions of Kafka, but we've only experienced it 
with v3.5.0 + IBP 3.5. Curiously enough we haven't ran into this issue with 
v3.5.0 + IBP 2.5, assumedly because the ALLOCATE_PRODUCER_IDS and 
ALTER_PARTITION features are not yet enabled until IBP 2.7 and 3.0, 
respectively.

 

Is there any guidance on whether this failure path is possible as described and 
what steps can be taken to mitigate the issue?

 

Thank you!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to