[ https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082252#comment-17082252 ]
Nitay Kufert commented on KAFKA-9824: ------------------------------------- [~hachikuji] hey, sorry for the late response So yeah, the dumps don't exist anymore... if and when it will happen again ill try to achieve the requested information For now, we changed the auto.offset.reset to be none - so we can handle it manually (or even it will resolve itself after the app will restart). Does it make sense? as far as I understand, the reason for the offset reset is when leader change and one of the consumers is ahead (at least, that's one of the reason this happens) > Consumer loses partition offset and resets post 2.4.1 version upgrade > --------------------------------------------------------------------- > > Key: KAFKA-9824 > URL: https://issues.apache.org/jira/browse/KAFKA-9824 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.4.1 > Reporter: Nitay Kufert > Priority: Major > Attachments: image-2020-04-06-13-14-47-014.png > > > Hello, > around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from > 2.3.1), > and we started noticing a troubling behavior that we didn't see before: > > Without apparent reason, a specific partition on a specific consumer loses > its offset and start re-consuming the entire partition from the beginning > (according to the retention). > > Messages appearing on the consumer (client): > {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Resetting offset for partition > trackingSolutionAttribution-48 to offset 1216430527. > {quote} > {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of > range for partition trackingSolutionAttribution-48 > {quote} > Those are the logs from the brokers at the same time (searched for > "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4") > {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 > in 0 ms. > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > {quote} > Another way to see the same thing, from our monitoring (DD) on the partition > offset: > !image-2020-04-06-13-14-47-014.png|width=530,height=152! > The recovery you are seeing is after I run partition offset reset manually > (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic > trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 > --reset-offsets --to-datetime 'SOME DATE') > > Any idea what can be causing this? we have it happen to us at least 5 times > since the upgrade, and before that, I don't remember it ever happening to us. > > Topic config is set to default, except the retention, which is manually set > to 43200000. > The topic has 60 partitions & a replication factor of 2. > > Consumer config: > {code:java} > ConsumerConfig values: > allow.auto.create.topics = true > auto.commit.interval.ms = 5000 > auto.offset.reset = earliest > bootstrap.servers = [..........] > check.crcs = true > client.dns.lookup = default > client.id = > client.rack = > connections.max.idle.ms = 540000 > default.api.timeout.ms = 60000 > enable.auto.commit = true > exclude.internal.topics = true > fetch.max.bytes = 52428800 > fetch.max.wait.ms = 500 > fetch.min.bytes = 1 > group.id = fireAttributionConsumerGroup4 > group.instance.id = null > heartbeat.interval.ms = 10000 > interceptor.classes = [] > internal.leave.group.on.close = true > isolation.level = read_uncommitted > key.deserializer = class > org.apache.kafka.common.serialization.ByteArrayDeserializer > max.partition.fetch.bytes = 1048576 > max.poll.interval.ms = 300000 > max.poll.records = 500 > metadata.max.age.ms = 300000 > metric.reporters = [] > metrics.num.samples = 2 > metrics.recording.level = INFO > metrics.sample.window.ms = 30000 > partition.assignment.strategy = [class > org.apache.kafka.clients.consumer.RangeAssignor] > receive.buffer.bytes = 65536 > reconnect.backoff.max.ms = 1000 > reconnect.backoff.ms = 50 > request.timeout.ms = 30000 > retry.backoff.ms = 100 > sasl.client.callback.handler.class = null > sasl.jaas.config = null > sasl.kerberos.kinit.cmd = /usr/bin/kinit > sasl.kerberos.min.time.before.relogin = 60000 > sasl.kerberos.service.name = null > sasl.kerberos.ticket.renew.jitter = 0.05 > sasl.kerberos.ticket.renew.window.factor = 0.8 > sasl.login.callback.handler.class = null > sasl.login.class = null > sasl.login.refresh.buffer.seconds = 300 > sasl.login.refresh.min.period.seconds = 60 > sasl.login.refresh.window.factor = 0.8 > sasl.login.refresh.window.jitter = 0.05 > sasl.mechanism = GSSAPI > security.protocol = PLAINTEXT > security.providers = null > send.buffer.bytes = 131072 > session.timeout.ms = 30000 > ssl.cipher.suites = null > ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] > ssl.endpoint.identification.algorithm = https > ssl.key.password = null > ssl.keymanager.algorithm = SunX509 > ssl.keystore.location = null > ssl.keystore.password = null > ssl.keystore.type = JKS > ssl.protocol = TLS > ssl.provider = null > ssl.secure.random.implementation = null > ssl.trustmanager.algorithm = PKIX > ssl.truststore.location = null > ssl.truststore.password = null > ssl.truststore.type = JKS > value.deserializer = class > org.apache.kafka.common.serialization.StringDeserializer > {code} > > Broker config (server.properties) - We have 9 brokers: > {code:java} > num.io.threads=4 > num.network.threads=64 > socket.send.buffer.bytes=33554432 > socket.receive.buffer.bytes=33554432 > socket.request.max.bytes=104857600 > num.partitions=2 > log.retention.hours=1 > log.segment.bytes=536870912 > log.retention.check.interval.ms=60000 > zookeeper.connection.timeout.ms=1000000 > controlled.shutdown.enable=true > auto.leader.rebalance.enable=true > log.cleaner.enable=true > log.cleaner.min.cleanable.ratio=0.1 > log.cleaner.threads=2 > log.cleanup.policy=delete > log.cleaner.delete.retention.ms=86400000 > log.cleaner.io.max.bytes.per.second=1.7976931348623157E308 > num.recovery.threads.per.data.dir=1 > log.flush.interval.messages=9223372036854775807 > message.max.bytes=10000000 > replica.fetch.max.bytes=10000000 > default.replication.factor=2 > delete.topic.enable=true > unclean.leader.election.enable=false > compression.type=snappy > inter.broker.protocol.version=2.4 > {code} > > Let me know if you need more data from me > > Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005)