Hi there, This sounds similar to https://issues.apache.org/jira/browse/KAFKA-4477. Have you tried 0.10.1.1?
-Jason On Fri, Jan 20, 2017 at 5:27 PM, Hui Yang <huiy...@expedia.com> wrote: > Hi, Kafka Team > > This is Hui Yang from Expedia engineer team and want to ask a question > about Kafka 10 issue. > Our team use Kafka as our core infrastructure and recently upgrade from > Kafka 0.8.2.2 to Kafka 0.10.1.0 but get a issue after the upgrade. > > The issue is as below: > Kafka 10 works well after the upgrade for couple days but then we started > to see "java.io.IOException: Connection to 3 was disconnected before the > response was read” on each Kafka broker when trying to communicate to > controller (as you may know, one of the Kafka broker is acting as a > controller to handle the topic/partition assignment and state change task, > in our case, it is the broker 3). > Even on the controller log, I found "[Controller-3-to-broker-3-send-thread], > Controller 3 epoch 3 fails to send request,java.io.IOException: Connection > to 3 was disconnected before the response was read”, looks it is even not > able to sent message to itself. > After we saw those exception on brokers for a while, we started to see > timeout exception from our producer side that our producer is not able to > send messages to brokers. > > When I check the JMX metrics, I found the CPU usage for controller is > always higher than other brokers after we upgrade to Kafka 10(brokers have > similar CPU usage when Kafka 8) and memory increased for a spike > specifically for the controller during the issue. I assume the controller > may not have enough memory left to create new connections for the producer > and other brokers. > > One more need to mention is we use the Kafka 0.8 protocol and format on > Kafka 0.10 brokers that we can still use 0.8 clients. > > Details for the exception: > " WARN [ReplicaFetcherThread-0-3], Error in fetch kafka.server. > ReplicaFetcherThread$FetchRequest@87d8e00 (kafka.server. > ReplicaFetcherThread) > java.io.IOException: Connection to 3 was disconnected before the response > was read > at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115) > at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112) > at scala.Option.foreach(Option.scala:257) > at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1.apply(NetworkClientBlockingOps.scala:112) > at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1.apply(NetworkClientBlockingOps.scala:108) > at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1( > NetworkClientBlockingOps.scala:137) > at kafka.utils.NetworkClientBlockingOps$.kafka$utils$ > NetworkClientBlockingOps$$pollContinuously$extension( > NetworkClientBlockingOps.scala:143) > at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension( > NetworkClientBlockingOps.scala:108) > at kafka.server.ReplicaFetcherThread.sendRequest( > ReplicaFetcherThread.scala:253) > at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) > at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at kafka.server.AbstractFetcherThread.processFetchRequest( > AbstractFetcherThread.scala:118) > at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala: > 103) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)" > > "WARN [Controller-3-to-broker-3-send-thread], Controller 3 epoch 1 fails > to send request > java.io.IOException: Connection to 2 was disconnected before the response > was read > at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115) > at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112) > at scala.Option.foreach(Option.scala:257) > at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1.apply(NetworkClientBlockingOps.scala:112) > at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$ > extension$1.apply(NetworkClientBlockingOps.scala:108) > at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1( > NetworkClientBlockingOps.scala:137) > at kafka.utils.NetworkClientBlockingOps$.kafka$utils$ > NetworkClientBlockingOps$$pollContinuously$extension( > NetworkClientBlockingOps.scala:143) > at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension( > NetworkClientBlockingOps.scala:108) > at kafka.controller.RequestSendThread.liftedTree1$ > 1(ControllerChannelManager.scala:190) > at kafka.controller.RequestSendThread.doWork(ControllerChannelManager. > scala:181) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)” > > In production, we build 6 Kafka brokers with 3 zookeeper nodes on the AWS > using C3.xlarge type. > Our JVM settings is as follow: -Xmx1G -Xms1G –server > -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC > -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark. > Our traffic is 500 TPS and each message has average 100KB size. > > I am appreciate for your time to give us any help and suggestion about > this issue! > > Best, > > Hui >