mlowicki created KAFKA-12836: -------------------------------- Summary: Possible race in handling IP address change Key: KAFKA-12836 URL: https://issues.apache.org/jira/browse/KAFKA-12836 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.4.1 Reporter: mlowicki
We've experienced few incidents where suddenly Kafka producers weren't able to send messages and were simply timing out. I've a repro now after some investigation. We're running the cluster of 3 brokers on K8s and it's enough to kill the underlying AWS VM using AWS Console which causes also IP change (in our case 10.128.132.196 → 10.128.135.90). It needs to be VM hosting broker which is a leader of partition (node 10000). Slightly before the kill there were 2 events: {quote}About to close the idle connection from 10001 due to being idle for 38545 millis About to close the idle connection from 10002 due to being idle for 34946 millis{quote} And then client picks the node to connect to: {quote}Removing node [10.128.140.202:9092|http://10.128.140.202:9092/] (id: 10002 rack: us-east-1c) from least loaded node selection since it is neither ready for sending or connecting Found least loaded connecting node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) About to close the idle connection from 10000 due to being idle for 30529 millis Node 10000 disconnected. Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) using address /[10.128.132.196|http://10.128.132.196/]{quote} and then it repeats: {quote}Found least loaded connecting node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) About to close the idle connection from 10000 due to being idle for 30027 millis Node 10000 disconnected. Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) using address /[10.128.132.196|http://10.128.132.196/]{quote} {quote}Found least loaded connecting node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) About to close the idle connection from 10000 due to being idle for 30027 millis Node 10000 disconnected. Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) using address /[10.128.132.196|http://10.128.132.196/]{quote} {quote}Found least loaded connecting node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) About to close the idle connection from 10000 due to being idle for 30027 millis Node 10000 disconnected. Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) using address /[10.128.132.196|http://10.128.132.196/]{quote} Affected clients won't ever request METADATA from other brokers (10001 or 10002) to discover new IP. It doesn't happen for all clients as some handle the process gracefully and those log with: {quote}[log.info|http://log.info/]("Hostname for node {} changed from {} to {}.", id, connectionState.host(), host);{quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)