[jira] [Updated] (KAFKA-6565) Sudden unresponsiveness from broker + file descriptor leak

Noam Berman (JIRA) Thu, 15 Feb 2018 12:40:36 -0800

     [ 
https://issues.apache.org/jira/browse/KAFKA-6565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Noam Berman updated KAFKA-6565:
-------------------------------
    Description: 
Running 3 kafka clusters v0.10.1.0  on aws, 5 brokers each (+5 zookeepers), in 
the past week about 2-3 times a day (on all clusters, no specific brokers) we 
encounter a situation in which a single broker stops responding to all requests 
from clients/other brokers. at this moment all produced messages fail, and 
other brokers start writing this to log repeatedly:

[2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch 
kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c 
(kafka.server.ReplicaFetcherThread)
 java.io.IOException: Connection to 878 was disconnected before the response 
was read
 at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
 at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
 at scala.Option.foreach(Option.scala:257)
 at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
 at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
 at 
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
 at 
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
 at 
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
 at 
kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
 at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
 at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
 at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
 at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

 

producers (examples):

 2018-02-13 21:30:51:010 thread=[kafka-producer-network-thread | 
catalog-index-0] WARN  Got error produce response with correlation id 8147 on 
topic-partition catalog-1, retrying (49 attempts left). Error: REQUEST_TIMED_OUT

2018-02-13 21:31:06:221 thread=[kafka-producer-network-thread | 
catalog-index-0] WARN  Got error produce response with correlation id 8166 on 
topic-partition catalog-index-18, retrying (48 attempts left). Error: 
NOT_LEADER_FOR_PARTITION

 

Once this happens, file descriptors on the affected broker start increasing at 
a high rate until it reaches its maximum.

no relevant info logs on the affected broker, we'll try to gather debug logs 
and attach them to the bug next time it happens.

the big issue here is that although the broker is unresponsive, it stays in the 
cluster and its zookeeper node isn't cleared, so there's no rebalance 
whatsoever - all clients start failing until a manual shutdown occurs.

 

  was:
Running 3 kafka clusters v0.10.1.0  on aws, 5 brokers each (+5 zookeepers), in 
the past week about 2-3 times a day (on all clusters, no specific brokers) we 
encounter a situation in which a single broker stops responding to all requests 
from clients/other brokers. at this moment all produced messages fail, and 
other brokers start writing this to log repeatedly:

[2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch 
kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c 
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 878 was disconnected before the response was 
read
 at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
 at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
 at scala.Option.foreach(Option.scala:257)
 at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
 at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
 at 
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
 at 
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
 at 
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
 at 
kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
 at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
 at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
 at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
 at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

 

Once this happens, file descriptors on the affected broker start increasing at 
a high rate until it reaches its maximum.

no relevant info logs on the affected broker, we'll try to gather debug logs 
and attach them to the bug next time it happens.

the big issue here is that although the broker is unresponsive, it stays in the 
cluster and its zookeeper node isn't cleared, so there's no rebalance 
whatsoever - all clients start failing until a manual shutdown occurs.

 


> Sudden unresponsiveness from broker + file descriptor leak
> ----------------------------------------------------------
>
>                 Key: KAFKA-6565
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6565
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.0
>         Environment: 3 separate clusters running 0.10.1.0 with 5 nodes  + 5 
> zookeepers.
> clients are mostly java 0.10.2.1, and a few nodejs clients 
> (https://github.com/oleksiyk/kafka). 
> throughput per broker: ~3.5MB per second. each broker is a leader of ~2500 
> partitions. 
>            Reporter: Noam Berman
>            Priority: Blocker
>
> Running 3 kafka clusters v0.10.1.0  on aws, 5 brokers each (+5 zookeepers), 
> in the past week about 2-3 times a day (on all clusters, no specific brokers) 
> we encounter a situation in which a single broker stops responding to all 
> requests from clients/other brokers. at this moment all produced messages 
> fail, and other brokers start writing this to log repeatedly:
> [2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c 
> (kafka.server.ReplicaFetcherThread)
>  java.io.IOException: Connection to 878 was disconnected before the response 
> was read
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
>  at 
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
>  at 
> kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
>  at 
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
>  at 
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
>  at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
>  
> producers (examples):
>  2018-02-13 21:30:51:010 thread=[kafka-producer-network-thread | 
> catalog-index-0] WARN  Got error produce response with correlation id 8147 on 
> topic-partition catalog-1, retrying (49 attempts left). Error: 
> REQUEST_TIMED_OUT
> 2018-02-13 21:31:06:221 thread=[kafka-producer-network-thread | 
> catalog-index-0] WARN  Got error produce response with correlation id 8166 on 
> topic-partition catalog-index-18, retrying (48 attempts left). Error: 
> NOT_LEADER_FOR_PARTITION
>  
> Once this happens, file descriptors on the affected broker start increasing 
> at a high rate until it reaches its maximum.
> no relevant info logs on the affected broker, we'll try to gather debug logs 
> and attach them to the bug next time it happens.
> the big issue here is that although the broker is unresponsive, it stays in 
> the cluster and its zookeeper node isn't cleared, so there's no rebalance 
> whatsoever - all clients start failing until a manual shutdown occurs.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KAFKA-6565) Sudden unresponsiveness from broker + file descriptor leak

Reply via email to