Team,

We are observing ISR shrink and expand very frequently. In the logs of the
follower, below errors are observed:

[2018-12-06 20:00:42,709] WARN [ReplicaFetcherThread-2-15], Error in fetch
kafka.server.ReplicaFetcherThread$FetchRequest@a0f9ba9
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 15 was disconnected before the response
was read
        at
kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114)
        at
kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112)
        at scala.Option.foreach(Option.scala:257)
        at
kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112)
        at
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
        at
kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142)
        at
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
        at
kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249)
        at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
        at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
        at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

Can someone explain this? And help us understand how we can resolve these
under-replicated partitions.

server.properties file:
broker.id=15
port=9092
zookeeper.connect=zk1,zk2,zk3,zk4,zk5,zk6

default.replication.factor=2
log.dirs=/data/kafka
delete.topic.enable=true
zookeeper.session.timeout.ms=10000
inter.broker.protocol.version=0.10.2
num.partitions=3
min.insync.replicas=1
log.retention.ms=259200000
message.max.bytes=20971520
replica.fetch.max.bytes=20971520
replica.fetch.response.max.bytes=20971520
max.partition.fetch.bytes=20971520
fetch.max.bytes=20971520
log.flush.interval.ms=5000
log.roll.hours=24
num.replica.fetchers=3
num.io.threads=8
num.network.threads=6
log.message.format.version=0.9.0.1

Also In what cases we lead to this state? We have 1200-1400 topics and
5000-6000 partitions spread across 20 node cluster. But only 30-40
partitions are under-replicated while rest are in-sync. 95% of these
partitions are having 2 replication factor.

-- 
*Suman*

Reply via email to