[ https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077128#comment-17077128 ]
Nitay Kufert commented on KAFKA-9824: ------------------------------------- [~hachikuji] I can see the logs you suggested to search, but not near when we lost the partition offset (the logs from today 7/4 are because of the re-assignment going on) {code:java} Apr 7, 2020 @ 13:37:52.877INFOtrackingSolutionAttribution-48 starts at leader epoch 40 from offset 1247805916 with high watermark 1247805834. Previous leader epoch was 38. Apr 7, 2020 @ 13:37:52.877INFOtrackingSolutionAttribution-48 starts at leader epoch 40 from offset 1247805916 with high watermark 1247805834. Previous leader epoch was 38. Apr 7, 2020 @ 13:37:52.851INFOtrackingSolutionAttribution-48 starts at leader epoch 38 from offset 1247805916 with high watermark 1247805834. Previous leader epoch was 37. Apr 7, 2020 @ 13:37:52.851INFOtrackingSolutionAttribution-48 starts at leader epoch 38 from offset 1247805916 with high watermark 1247805834. Previous leader epoch was 37. Mar 22, 2020 @ 22:46:46.217INFOtrackingSolutionAttribution-48 starts at leader epoch 34 from offset 1053408212 with high watermark 1053408211. Previous leader epoch was 33. Mar 16, 2020 @ 17:04:33.396INFOtrackingSolutionAttribution-48 starts at leader epoch 32 from offset 982961807 with high watermark 982961805. Previous leader epoch was 31. Mar 16, 2020 @ 15:38:22.062INFOtrackingSolutionAttribution-48 starts at leader epoch 29 from offset 982210445 with high watermark 982210444. Previous leader epoch was 28. Mar 16, 2020 @ 14:02:20.613INFOtrackingSolutionAttribution-48 starts at leader epoch 26 from offset 981481489 with high watermark 981481489. Previous leader epoch was 25. Mar 16, 2020 @ 13:59:50.768INFOtrackingSolutionAttribution-48 starts at leader epoch 24 from offset 981464695 with high watermark 981464695. Previous leader epoch was 23. Mar 10, 2020 @ 04:32:35.230INFOtrackingSolutionAttribution-48 starts at Leader Epoch 21 from offset 924450729. Previous Leader Epoch was: 20 Mar 5, 2020 @ 18:47:32.793INFOtrackingSolutionAttribution-48 starts at Leader Epoch 19 from offset 882883160. Previous Leader Epoch was: 18 Feb 26, 2020 @ 04:54:26.071INFOtrackingSolutionAttribution-48 starts at Leader Epoch 17 from offset 807789330. Previous Leader Epoch was: 16 Feb 24, 2020 @ 05:38:49.292INFOtrackingSolutionAttribution-48 starts at Leader Epoch 15 from offset 790179337. Previous Leader Epoch was: 14 Feb 21, 2020 @ 03:06:26.145INFOtrackingSolutionAttribution-48 starts at Leader Epoch 13 from offset 758461732. Previous Leader Epoch was: 12 Feb 21, 2020 @ 02:15:12.120INFOtrackingSolutionAttribution-48 starts at Leader Epoch 11 from offset 758198824. Previous Leader Epoch was: 10 Feb 20, 2020 @ 11:38:21.393INFOtrackingSolutionAttribution-48 starts at Leader Epoch 9 from offset 752524221. Previous Leader Epoch was: 8 Feb 20, 2020 @ 07:54:42.183INFOtrackingSolutionAttribution-48 starts at Leader Epoch 7 from offset 751507480. Previous Leader Epoch was: 6 Feb 6, 2020 @ 23:19:32.178INFOtrackingSolutionAttribution-48 starts at Leader Epoch 5 from offset 630170838. Previous Leader Epoch was: 4 {code} {code:java} Apr 7, 2020 @ 13:37:55.391INFOTruncating to offset 1247805916Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme1n1 Apr 7, 2020 @ 13:37:55.391INFOTruncating to offset 1247805916Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme1n1 Mar 22, 2020 @ 22:49:39.831INFOTruncating to offset 1053436907Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Mar 16, 2020 @ 17:08:06.692INFOTruncating to offset 982992666Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Mar 16, 2020 @ 15:43:38.562INFOTruncating to offset 982253205Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Mar 16, 2020 @ 14:06:37.653INFOTruncating to offset 981512930Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Mar 10, 2020 @ 04:33:49.030INFOTruncating to offset 924457256Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Mar 5, 2020 @ 18:48:46.584INFOTruncating to offset 882891300Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Feb 26, 2020 @ 04:57:57.480INFOTruncating to offset 807809547Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Feb 24, 2020 @ 05:40:37.387INFOTruncating to offset 790188884Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Feb 21, 2020 @ 03:10:37.631INFOTruncating to offset 758484243Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Feb 21, 2020 @ 02:15:34.822INFOTruncating to offset 758200579Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 Feb 20, 2020 @ 11:39:55.060INFOTruncating to offset 752531954Log partition=trackingSolutionAttribution-48, dir=/var/lib/kafka/nvme0n1 {code} +*It seems the reassignment didn't stop the log from being printed over and over again*+ > Consumer loses partition offset and resets post 2.4.1 version upgrade > --------------------------------------------------------------------- > > Key: KAFKA-9824 > URL: https://issues.apache.org/jira/browse/KAFKA-9824 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.4.1 > Reporter: Nitay Kufert > Priority: Major > Attachments: image-2020-04-06-13-14-47-014.png > > > Hello, > around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from > 2.3.1), > and we started noticing a troubling behavior that we didn't see before: > > Without apparent reason, a specific partition on a specific consumer loses > its offset and start re-consuming the entire partition from the beginning > (according to the retention). > > Messages appearing on the consumer (client): > {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Resetting offset for partition > trackingSolutionAttribution-48 to offset 1216430527. > {quote} > {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of > range for partition trackingSolutionAttribution-48 > {quote} > Those are the logs from the brokers at the same time (searched for > "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4") > {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 > in 0 ms. > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > {quote} > Another way to see the same thing, from our monitoring (DD) on the partition > offset: > !image-2020-04-06-13-14-47-014.png|width=530,height=152! > The recovery you are seeing is after I run partition offset reset manually > (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic > trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 > --reset-offsets --to-datetime 'SOME DATE') > > Any idea what can be causing this? we have it happen to us at least 5 times > since the upgrade, and before that, I don't remember it ever happening to us. > > Topic config is set to default, except the retention, which is manually set > to 43200000. > The topic has 60 partitions & a replication factor of 2. > > Consumer config: > {code:java} > ConsumerConfig values: > allow.auto.create.topics = true > auto.commit.interval.ms = 5000 > auto.offset.reset = earliest > bootstrap.servers = [..........] > check.crcs = true > client.dns.lookup = default > client.id = > client.rack = > connections.max.idle.ms = 540000 > default.api.timeout.ms = 60000 > enable.auto.commit = true > exclude.internal.topics = true > fetch.max.bytes = 52428800 > fetch.max.wait.ms = 500 > fetch.min.bytes = 1 > group.id = fireAttributionConsumerGroup4 > group.instance.id = null > heartbeat.interval.ms = 10000 > interceptor.classes = [] > internal.leave.group.on.close = true > isolation.level = read_uncommitted > key.deserializer = class > org.apache.kafka.common.serialization.ByteArrayDeserializer > max.partition.fetch.bytes = 1048576 > max.poll.interval.ms = 300000 > max.poll.records = 500 > metadata.max.age.ms = 300000 > metric.reporters = [] > metrics.num.samples = 2 > metrics.recording.level = INFO > metrics.sample.window.ms = 30000 > partition.assignment.strategy = [class > org.apache.kafka.clients.consumer.RangeAssignor] > receive.buffer.bytes = 65536 > reconnect.backoff.max.ms = 1000 > reconnect.backoff.ms = 50 > request.timeout.ms = 30000 > retry.backoff.ms = 100 > sasl.client.callback.handler.class = null > sasl.jaas.config = null > sasl.kerberos.kinit.cmd = /usr/bin/kinit > sasl.kerberos.min.time.before.relogin = 60000 > sasl.kerberos.service.name = null > sasl.kerberos.ticket.renew.jitter = 0.05 > sasl.kerberos.ticket.renew.window.factor = 0.8 > sasl.login.callback.handler.class = null > sasl.login.class = null > sasl.login.refresh.buffer.seconds = 300 > sasl.login.refresh.min.period.seconds = 60 > sasl.login.refresh.window.factor = 0.8 > sasl.login.refresh.window.jitter = 0.05 > sasl.mechanism = GSSAPI > security.protocol = PLAINTEXT > security.providers = null > send.buffer.bytes = 131072 > session.timeout.ms = 30000 > ssl.cipher.suites = null > ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] > ssl.endpoint.identification.algorithm = https > ssl.key.password = null > ssl.keymanager.algorithm = SunX509 > ssl.keystore.location = null > ssl.keystore.password = null > ssl.keystore.type = JKS > ssl.protocol = TLS > ssl.provider = null > ssl.secure.random.implementation = null > ssl.trustmanager.algorithm = PKIX > ssl.truststore.location = null > ssl.truststore.password = null > ssl.truststore.type = JKS > value.deserializer = class > org.apache.kafka.common.serialization.StringDeserializer > {code} > > Broker config (server.properties) - We have 9 brokers: > {code:java} > num.io.threads=4 > num.network.threads=64 > socket.send.buffer.bytes=33554432 > socket.receive.buffer.bytes=33554432 > socket.request.max.bytes=104857600 > num.partitions=2 > log.retention.hours=1 > log.segment.bytes=536870912 > log.retention.check.interval.ms=60000 > zookeeper.connection.timeout.ms=1000000 > controlled.shutdown.enable=true > auto.leader.rebalance.enable=true > log.cleaner.enable=true > log.cleaner.min.cleanable.ratio=0.1 > log.cleaner.threads=2 > log.cleanup.policy=delete > log.cleaner.delete.retention.ms=86400000 > log.cleaner.io.max.bytes.per.second=1.7976931348623157E308 > num.recovery.threads.per.data.dir=1 > log.flush.interval.messages=9223372036854775807 > message.max.bytes=10000000 > replica.fetch.max.bytes=10000000 > default.replication.factor=2 > delete.topic.enable=true > unclean.leader.election.enable=false > compression.type=snappy > inter.broker.protocol.version=2.4 > {code} > > Let me know if you need more data from me > > Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005)