[
https://issues.apache.org/jira/browse/KAFKA-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767115#comment-17767115
]
Steve Jacobs commented on KAFKA-15467:
--------------------------------------
This appears to coincide with our upgrade from MM2 version 3.4.1 to version
3.5.1. I've reverted to 3.4.1 to see if the problem still occurs.
> On reconnect mm2 cannot find offset that exists in remote cluster and
> re-syncs the entire topic
> -----------------------------------------------------------------------------------------------
>
> Key: KAFKA-15467
> URL: https://issues.apache.org/jira/browse/KAFKA-15467
> Project: Kafka
> Issue Type: Bug
> Components: mirrormaker
> Affects Versions: 3.5.1
> Reporter: Steve Jacobs
> Priority: Major
>
> I'm encountering an odd issue with mirrormaker2 with our remote replication
> setup to high latency remote sites (satellite).
> Every few days we get several topics completely re-replicated, this appears
> to happen after a network connectivity outage. It doesn't matter if it's a
> long outage (hours) or a short one (minutes). And it only seems to affect a
> few topics.
> I was finally able to track down some logs showing the issue. This was after
> an hour-ish long outage where connectivity went down. There were lots of logs
> about connection timeouts, etc. Here is the relevant part when the connection
> came back up:
> {code:java}
> 2023-09-08 16:52:45,380 INFO [scbi->gcp.MirrorSourceConnector|worker]
> [AdminClient
> clientId=mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> Disconnecting from node 0 due to socket connection setup timeout. The
> timeout value is 63245 ms. (org.apache.kafka.clients.NetworkClient)
> [kafka-admin-client-thread |
> mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> 2023-09-08 16:52:45,380 INFO [scbi->gcp.MirrorSourceConnector|worker]
> [AdminClient
> clientId=mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> Metadata update failed
> (org.apache.kafka.clients.admin.internals.AdminMetadataManager)
> [kafka-admin-client-thread |
> mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> 2023-09-08 16:52:47,029 INFO [scbi->gcp.MirrorSourceConnector|task-1]
> [Consumer
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
> groupId=null] Disconnecting from node 0 due to socket connection setup
> timeout. The timeout value is 52624 ms.
> (org.apache.kafka.clients.NetworkClient)
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> 2023-09-08 16:52:47,029 INFO [scbi->gcp.MirrorSourceConnector|task-1]
> [Consumer
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
> groupId=null] Error sending fetch request (sessionId=460667411,
> epoch=INITIAL) to node 0: (org.apache.kafka.clients.FetchSessionHandler)
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> 2023-09-08 16:52:47,336 INFO [scbi->gcp.MirrorSourceConnector|worker]
> refreshing topics took 67359 ms (org.apache.kafka.connect.mirror.Scheduler)
> [Scheduler for MirrorSourceConnector:
> scbi->gcp|scbi->gcp.MirrorSourceConnector-refreshing topics]
> 2023-09-08 16:52:48,413 INFO [scbi->gcp.MirrorSourceConnector|task-1]
> [Consumer
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
> groupId=null] Fetch position FetchPosition{offset=4918131,
> offsetEpoch=Optional[0],
> currentLeader=LeaderAndEpoch{leader=Optional[kafka.scbi.eng.neoninternal.org:9094
> (id: 0 rack: null)], epoch=0}} is out of range for partition
> reading.sensor.hfp01sc-0, resetting offset
> (org.apache.kafka.clients.consumer.internals.AbstractFetch)
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> (Repeats for 11 more topics)
> 2023-09-08 16:52:48,479 INFO [scbi->gcp.MirrorSourceConnector|task-1]
> [Consumer
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
> groupId=null] Resetting offset for partition reading.sensor.hfp01sc-0 to
> position FetchPosition{offset=3444977, offsetEpoch=Optional.empty,
> currentLeader=LeaderAndEpoch{leader=Optional[kafka.scbi.eng.neoninternal.org:9094
> (id: 0 rack: null)], epoch=0}}.
> (org.apache.kafka.clients.consumer.internals.SubscriptionState)
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> (Repeats for 11 more topics) {code}
> The consumer reports that offset 4918131 is out of range for this
> topic/partition, but that offset still exists on the remote cluster. I can go
> pull it up with a consumer right now. The earliest offset in that topic that
> still exists is 3444977 as of yesterday. We have 30 day retention configured
> so pulling in 30 days of duplicate data is very not good. It almost seems
> like a race condition as there are 38 topics we replicate but this only
> affected 12 (on this occurance).
>
> The number of topics affected seems to vary each time. Today I see one site
> has 2 topics it is resending, and another has 13.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)