Guys, Another observation is 90% of under-replicated partitions have the same node as the follower.
*Any help in here is very much appreciated. We have very less time to stabilize kafka. Thanks a lot in advance.* -Suman On Thu, Dec 6, 2018 at 9:08 PM Suman B N <sumannew...@gmail.com> wrote: > +users > > On Thu, Dec 6, 2018 at 9:01 PM Suman B N <sumannew...@gmail.com> wrote: > >> Team, >> >> We are observing ISR shrink and expand very frequently. In the logs of >> the follower, below errors are observed: >> >> [2018-12-06 20:00:42,709] WARN [ReplicaFetcherThread-2-15], Error in >> fetch kafka.server.ReplicaFetcherThread$FetchRequest@a0f9ba9 >> (kafka.server.ReplicaFetcherThread) >> java.io.IOException: Connection to 15 was disconnected before the >> response was read >> at >> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114) >> at >> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112) >> at scala.Option.foreach(Option.scala:257) >> at >> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112) >> at >> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136) >> at >> kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142) >> at >> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) >> at >> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249) >> at >> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234) >> at >> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) >> at >> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) >> at >> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) >> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) >> >> Can someone explain this? And help us understand how we can resolve these >> under-replicated partitions. >> >> server.properties file: >> broker.id=15 >> port=9092 >> zookeeper.connect=zk1,zk2,zk3,zk4,zk5,zk6 >> >> default.replication.factor=2 >> log.dirs=/data/kafka >> delete.topic.enable=true >> zookeeper.session.timeout.ms=10000 >> inter.broker.protocol.version=0.10.2 >> num.partitions=3 >> min.insync.replicas=1 >> log.retention.ms=259200000 >> message.max.bytes=20971520 >> replica.fetch.max.bytes=20971520 >> replica.fetch.response.max.bytes=20971520 >> max.partition.fetch.bytes=20971520 >> fetch.max.bytes=20971520 >> log.flush.interval.ms=5000 >> log.roll.hours=24 >> num.replica.fetchers=3 >> num.io.threads=8 >> num.network.threads=6 >> log.message.format.version=0.9.0.1 >> >> Also In what cases we lead to this state? We have 1200-1400 topics and >> 5000-6000 partitions spread across 20 node cluster. But only 30-40 >> partitions are under-replicated while rest are in-sync. 95% of these >> partitions are having 2 replication factor. >> >> -- >> *Suman* >> > > > -- > *Suman* > *OlaCabs* > -- *Suman* *OlaCabs*