[ 
https://issues.apache.org/jira/browse/KAFKA-13901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Augarten updated KAFKA-13901:
---------------------------------
    Description: 
I'm currently running into this problem running org.apache.kafka:kafka and 
org.apache.kafka:kafka-client version 2.5.1.  

 

The symptoms of this problem are very similar to 
https://issues.apache.org/jira/browse/KAFKA-8803

 

I'm currently running a test that runs one embedded Kafka broker and an 
embedded Flink cluster (version 1.13). The Flink application uses an exactly 
once Kafka producer. When initializing, the FlinkKafkaProducer calls 
[initTransactions|https://github.com/apache/flink/blob/be969dd73b533b03acaba1d81d03b29fccc54bfa/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaProducer.java#L1282-L1285]
 for each of ~50 underlying KafkaProducers. On some executions of the test (but 
not all of them), one of these calls to InitProducerId times out. Based on my 
reading of the logs and some debugging sessions, it seems that the failing 
producer continually tries to issue FindCoordinatorRequests, but 
[Sender.awaitNodeReady|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L511-L526]
 returns null. It seems like every time 
[NetworkClient.leastLoadedNode|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L662]
 returns null and prints that the one broker "is neither ready for sending or 
connecting"([source|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L692]).

I set the producer max.block.ms to be 5mins and you can see in the logs that 
this goes on and on for the full 5 minutes. During this time, the broker 
appears healthy (and even serves other requests). It seems that the client is 
not even attempting to reconnect to the broker during this 5 minutes, though I 
truthfully don't understand what could be going wrong looking through the code. 
Do you have any ideas? Any more information I could provide? 


  was:
I'm currently running into this problem running org.apache.kafka:kafka and 
org.apache.kafka:kafka-client version 2.5.1.  

 

The symptoms of this problem are very similar to 
https://issues.apache.org/jira/browse/KAFKA-8803

 

I'm currently running a test that runs one embedded Kafka broker and an 
embedded Flink cluster (version 1.13). The Flink application uses an exactly 
once Kafka producer. When initializing, the FlinkKafkaProducer calls 
[initTransactions|https://github.com/apache/flink/blob/be969dd73b533b03acaba1d81d03b29fccc54bfa/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaProducer.java#L1282-L1285]
 for each of ~50 underlying KafkaProducers. On some executions of the test (but 
not all of them), one of these calls to InitProducerId times out. Based on my 
reading of the logs and some debugging sessions, it seems that the failing 
producer continually tries to issue FindCoordinatorRequests, but 
[Sender.awaitNodeReady|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L511-L526]
 returns null. It seems like every time 
[NetworkClient.leastLoadedNode|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L662]
 returns null prints that the one broker "is neither ready for sending or 
connecting"([source|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L692]).

I set the producer max.block.ms to be 5mins and you can see in the logs that 
this goes on and on for the full 5 minutes. During this time, the broker 
appears healthy (and even serves other requests). It seems that the client is 
not even attempting to reconnect to the broker during this 5 minutes, though I 
truthfully don't understand what could be going wrong looking through the code. 
Do you have any ideas? Any more information I could provide? 



> Exactly once producer cannot start due to TimeoutException: Timeout expired 
> after 300000ms awaiting InitProducerId
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13901
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13901
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.5.1
>            Reporter: Ben Augarten
>            Priority: Major
>         Attachments: broker-logs-renamed-topics, client-logs-truncated
>
>
> I'm currently running into this problem running org.apache.kafka:kafka and 
> org.apache.kafka:kafka-client version 2.5.1.  
>  
> The symptoms of this problem are very similar to 
> https://issues.apache.org/jira/browse/KAFKA-8803
>  
> I'm currently running a test that runs one embedded Kafka broker and an 
> embedded Flink cluster (version 1.13). The Flink application uses an exactly 
> once Kafka producer. When initializing, the FlinkKafkaProducer calls 
> [initTransactions|https://github.com/apache/flink/blob/be969dd73b533b03acaba1d81d03b29fccc54bfa/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaProducer.java#L1282-L1285]
>  for each of ~50 underlying KafkaProducers. On some executions of the test 
> (but not all of them), one of these calls to InitProducerId times out. Based 
> on my reading of the logs and some debugging sessions, it seems that the 
> failing producer continually tries to issue FindCoordinatorRequests, but 
> [Sender.awaitNodeReady|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L511-L526]
>  returns null. It seems like every time 
> [NetworkClient.leastLoadedNode|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L662]
>  returns null and prints that the one broker "is neither ready for sending or 
> connecting"([source|https://github.com/apache/kafka/blob/2.5/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L692]).
> I set the producer max.block.ms to be 5mins and you can see in the logs that 
> this goes on and on for the full 5 minutes. During this time, the broker 
> appears healthy (and even serves other requests). It seems that the client is 
> not even attempting to reconnect to the broker during this 5 minutes, though 
> I truthfully don't understand what could be going wrong looking through the 
> code. Do you have any ideas? Any more information I could provide? 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to