[ https://issues.apache.org/jira/browse/KAFKA-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759713#comment-16759713 ]
Kemal ERDEN commented on KAFKA-7888: ------------------------------------ Thanks [~junrao], we'll do that once it's released. In the meantime we've incremented the zk connection and session timeouts to 30 seconds from the default 6 seconds. We're thinking of splitting the ZKs and the broker boxes (3 servers to 6). Can you see these changes helping solve the problem? Do you have any other recommendations to rectify/reduce the probability of having this problem? Thanks. > kafka cluster not recovering - Shrinking ISR from 14,13 to 13 > (kafka.cluster.Partition) continously > --------------------------------------------------------------------------------------------------- > > Key: KAFKA-7888 > URL: https://issues.apache.org/jira/browse/KAFKA-7888 > Project: Kafka > Issue Type: Bug > Components: controller, replication, zkclient > Affects Versions: 2.1.0 > Environment: using kafka_2.12-2.1.0 > 3 ZKs 3 Broker cluster, using 3 boxes (1 ZK and 1 broker on each box), > default.replication factor: 2, > offset replication factor was 1 when the error happened, increased to 2 after > seeing this error by reassigning-partitions. > compression: default (producer) on broker but sending gzip from producers. > linux (redhat) etx4 kafka logs on single local disk > Reporter: Kemal ERDEN > Priority: Major > Attachments: combined.log, producer.log > > > we're seeing the following repeating logs on our kafka cluster from time to > time which seems to cause messages expiring on Producers and the cluster > going into a non-recoverable state. The only fix seems to be to restart > brokers. > {{Shrinking ISR from 14,13 to 13 (kafka.cluster.Partition)}} > {{Cached zkVersion [21] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition)}} > and later on the following log is repeated: > {{Got user-level KeeperException when processing sessionid:0xe046aa4f8e60000 > type:setData cxid:0x2df zxid:0xa000001fd txntype:-1 reqpath:n/a Error > Path:/brokers/topics/ucTrade/partitions/6/state Error:KeeperErrorCode = > BadVersion for /brokers/topics/ucTrade/partitions/6/state}} > We haven't interfered with any of the brokers/zookeepers whilst this happened. > I've attached a combined log which represents a combination of controller, > server and state change logs from each broker (ids 13,14 and 15, log files > have the suffix b13, b14, b15 respectively) > We have increased the heaps from 1g to 6g for the brokers and from 512m to 4g > for the zookeepers since this happened but not sure if it is relevant. the ZK > logs are unfortunately overwritten so can't provide those. > We produce varying message sizes but some messages are relatively large (6mb) > but we use compression on the producers (set to gzip). > I've attached some logs from one of our producers as well. > producer.properties that we've changed: > spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer > spring.kafka.producer.compression-type=gzip > spring.kafka.producer.retries=5 > spring.kafka.producer.acks=-1 > spring.kafka.producer.batch-size=1048576 > spring.kafka.producer.properties.linger.ms=200 > spring.kafka.producer.properties.request.timeout.ms=600000 > spring.kafka.producer.properties.max.block.ms=240000 > spring.kafka.producer.properties.max.request.size=104857600 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)