Still hoping for some help here. On Fri, Dec 7, 2018 at 12:24 AM Suman B N <sumannew...@gmail.com> wrote:
> Guys, > Another observation is 90% of under-replicated partitions have the same > node as the follower. > > *Any help in here is very much appreciated. We have very less time to > stabilize kafka. Thanks a lot in advance.* > > -Suman > > On Thu, Dec 6, 2018 at 9:08 PM Suman B N <sumannew...@gmail.com> wrote: > >> +users >> >> On Thu, Dec 6, 2018 at 9:01 PM Suman B N <sumannew...@gmail.com> wrote: >> >>> Team, >>> >>> We are observing ISR shrink and expand very frequently. In the logs of >>> the follower, below errors are observed: >>> >>> [2018-12-06 20:00:42,709] WARN [ReplicaFetcherThread-2-15], Error in >>> fetch kafka.server.ReplicaFetcherThread$FetchRequest@a0f9ba9 >>> (kafka.server.ReplicaFetcherThread) >>> java.io.IOException: Connection to 15 was disconnected before the >>> response was read >>> at >>> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114) >>> at >>> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112) >>> at scala.Option.foreach(Option.scala:257) >>> at >>> kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112) >>> at >>> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136) >>> at >>> kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142) >>> at >>> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) >>> at >>> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249) >>> at >>> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234) >>> at >>> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) >>> at >>> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) >>> at >>> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) >>> at >>> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) >>> >>> Can someone explain this? And help us understand how we can resolve >>> these under-replicated partitions. >>> >>> server.properties file: >>> broker.id=15 >>> port=9092 >>> zookeeper.connect=zk1,zk2,zk3,zk4,zk5,zk6 >>> >>> default.replication.factor=2 >>> log.dirs=/data/kafka >>> delete.topic.enable=true >>> zookeeper.session.timeout.ms=10000 >>> inter.broker.protocol.version=0.10.2 >>> num.partitions=3 >>> min.insync.replicas=1 >>> log.retention.ms=259200000 >>> message.max.bytes=20971520 >>> replica.fetch.max.bytes=20971520 >>> replica.fetch.response.max.bytes=20971520 >>> max.partition.fetch.bytes=20971520 >>> fetch.max.bytes=20971520 >>> log.flush.interval.ms=5000 >>> log.roll.hours=24 >>> num.replica.fetchers=3 >>> num.io.threads=8 >>> num.network.threads=6 >>> log.message.format.version=0.9.0.1 >>> >>> Also In what cases we lead to this state? We have 1200-1400 topics and >>> 5000-6000 partitions spread across 20 node cluster. But only 30-40 >>> partitions are under-replicated while rest are in-sync. 95% of these >>> partitions are having 2 replication factor. >>> >>> -- >>> *Suman* >>> >> >> >> -- >> *Suman* >> *OlaCabs* >> > > > -- > *Suman* > *OlaCabs* > -- *Suman* *OlaCabs*