[ https://issues.apache.org/jira/browse/KAFKA-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860011#comment-15860011 ]
Vipul Singh edited comment on KAFKA-4739 at 2/9/17 7:21 PM: ------------------------------------------------------------ Hey [~hachikuji]. We tried to reproduce the issue again. Please note: 1. In our broker config we use *request.timeout.ms* of *300001*, and *group.max.session.timeout.ms* of *300000* 2. In out client config, the only thing we have different from the default is in this gist: https://gist.github.com/neoeahit/c1d4027b975b95267e3cbe506899aef8 3. We tried to grep for our consumer group during the time the disconnection was happening(https://gist.github.com/neoeahit/622515ba391ddf8566bf09af880a6ae0 are the broker logs). If you see between 17:55:50,614 and 17:55:51,027, we weren't able to find any requests. 4. The logs at the client side, around this time are here. https://gist.github.com/neoeahit/3a0a5027bc3499b85cb888918faac2a3 ( Please note we have two brokers, and their ip's are 1.1.1.1 and 1.1.1.6[actual ips have been changed for the purpose of making logs publicly available]) [~huxi_2b] I am puzzled by that 40 seconds myself. We dont set it in the config anywhere, yet we are seeing this in the logs. One other thing which is a bit puzzling is that the max_wait_time=500 in client requests, dosent seem to be honored. Maybe because the connection is already disconnected? Please help us with any pointers, or any troubleshooting steps which we can use to help figure this issue. This is causing us a lot of pain, with consumers randomly being blocked. was (Author: neoeahit): Hey [~hachikuji]. We tried to reproduce the issue again. Please note: 1. In our broker config we use *request.timeout.ms* of *300001*, and *group.max.session.timeout.ms* of *300000* 2. In out client config, the only thing we have different from the default is in this gist: https://gist.github.com/neoeahit/c1d4027b975b95267e3cbe506899aef8 3. We tried to grep for our consumer group during the time the disconnection was happening(https://gist.github.com/neoeahit/622515ba391ddf8566bf09af880a6ae0). If you see between 17:55:50,614 and 17:55:51,027, we weren't able to find any requests. 4. The logs at the client side, around this time are here. https://gist.github.com/neoeahit/3a0a5027bc3499b85cb888918faac2a3 ( Please note we have two brokers, and their ip's are 1.1.1.1 and 1.1.1.6[actual ips have been changed for the purpose of making logs publicly available]) [~huxi_2b] I am puzzled by that 40 seconds myself. We dont set it in the config anywhere, yet we are seeing this in the logs. One other thing which is a bit puzzling is that the max_wait_time=500 in client requests, dosent seem to be honored. Maybe because the connection is already disconnected? Please help us with any pointers, or any troubleshooting steps which we can use to help figure this issue. This is causing us a lot of pain, with consumers randomly being blocked. > KafkaConsumer poll going into an infinite loop > ---------------------------------------------- > > Key: KAFKA-4739 > URL: https://issues.apache.org/jira/browse/KAFKA-4739 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 0.9.0.1 > Reporter: Vipul Singh > > We are seeing an issue with our kafka consumer where it seems to go into an > infinite loop while polling, trying to fetch data from kafka. We are seeing > the heartbeat requests on the broker from the consumer, but nothing else from > the kafka consumer. > We enabled debug level logging on the consumer, and see these logs: > https://gist.github.com/neoeahit/757bff7acdea62656f065f4dcb8974b4 > And this just goes on. The way we have been able to replicate this issue, is > by restarting the process in multiple successions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)