[ https://issues.apache.org/jira/browse/KAFKA-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898172#comment-17898172 ]
Asker commented on KAFKA-17232: ------------------------------- Hello [~gharris1727], Thank you very much for responding to our comment; it is highly valuable to our team. *“Are you seeing the task configuration error appearing continuously without ever resolving?”* Yes, we are seeing this ERROR log constantly. Last Friday, I upgraded from Kafka 3.6.0 to 3.9.0 and thought the error might be temporary. However, when I returned to work on Monday, I saw that the error continued to appear persistently, even though the mirroring itself is functioning: I can see that messages from cluster A to cluster B are appearing in the topics. *“Are you seeing the log messages indicating loading has finished?”* No, we did not see such log messages. *“Also, are you seeing any Scheduler logs mentioning loading initial consumer groups?”* No, we did not find such logs. Running the following command on the server where the Kafka broker and MirrorMaker 2 service are running returned no results: {code:bash} [root@kafka-analytics-3a~]# journalctl -u kafka-mirror-maker.service -xe | grep "loading initial consumer groups" {code} *“I also wonder if these log messages could be coming from cancelled tasks by accident.”* I think you asked a very pertinent question! I’m glad you brought it up. I hope we’re thinking along the same lines, but even if not, it’s still worth discussing. We have three clusters in our configuration: {code:bash} clusters=analytics-dev, app-dev, telemetry-dev {code} MirrorMaker is enabled from app-dev to analytics-dev and from telemetry-dev to analytics-dev. However, the Timeout while loading consumer groups error always references clientIds of clusters between which MirrorMaker should not be active: - analytics-dev->app-dev {code:bash} [2024-11-11 12:41:44,943] ERROR [Worker clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to reconfigure connector's tasks (MirrorCheckpointConnector), retrying after backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195) {code} - telemetry-dev->app-dev {code:bash} [2024-11-11 12:41:44,497] ERROR [Worker clientId=telemetry-dev->app-dev, groupId=telemetry-dev-mm2] Failed to reconfigure connector's tasks (MirrorCheckpointConnector), retrying after backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195) {code} - app-dev->telemetry-dev {code:bash} [2024-11-11 12:41:44,943] ERROR [Worker clientId=app-dev->telemetry-dev, groupId=app-dev-mm2] Failed to reconfigure connector's tasks (MirrorCheckpointConnector), retrying after backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195) {code} And so on. This means that between analytics-dev->app-dev, there are no topics configured for MirrorMaker, and the same applies for telemetry-dev->app-dev, etc. In other words, MirrorMaker is attempting to interact between clusters where it is not supposed to be active. Additionally, it’s noticeable that the cluster app-dev always appears in these errors. I’m not sure why this is happening. The only distinguishing feature of this cluster is that it has ACLs, but analytics-dev also has ACLs. Our connect-mirror-maker.properties file looks like this: {code:bash} clusters=analytics-dev, app-dev, telemetry-dev # Analytics-dev cluster configuration analytics-dev.bootstrap.servers=kafka-analytics-1a:9092, kafka-analytics-2a:9092, kafka-analytics-3a:9092 analytics-dev.security.protocol=... analytics-dev.sasl.mechanism=... analytics-dev.sasl.jaas.config=... analytics-dev.checkpoints.topic.replication.factor=2 analytics-dev.heartbeats.topic.replication.factor=2 analytics-dev.offset-syncs.topic.replication.factor=2 analytics-dev.offset.storage.replication.factor=2 analytics-dev.status.storage.replication.factor=2 analytics-dev.config.storage.replication.factor=2 # App-dev cluster configuration app-dev.bootstrap.servers=kafka-app-1a:9092, kafka-app-2a:9092, kafka-app-3a:9092 app-dev.security.protocol=... app-dev.sasl.mechanism=... app-dev.sasl.jaas.config=... app-dev.checkpoints.topic.replication.factor=2 app-dev.heartbeats.topic.replication.factor=2 app-dev.offset-syncs.topic.replication.factor=2 app-dev.offset.storage.replication.factor=2 app-dev.status.storage.replication.factor=2 app-dev.config.storage.replication.factor=2 # Telemetry-dev cluster configuration telemetry-dev.bootstrap.servers=kafka-telemetry-1a:9092 telemetry-dev.security.protocol=... telemetry-dev.checkpoints.topic.replication.factor=1 telemetry-dev.heartbeats.topic.replication.factor=1 telemetry-dev.offset-syncs.topic.replication.factor=1 telemetry-dev.offset.storage.replication.factor=1 telemetry-dev.status.storage.replication.factor=1 telemetry-dev.config.storage.replication.factor=1 # Replication flows app-dev->analytics-dev.enabled=true app-dev->analytics-dev.topics=... analytics-dev->app-dev.enabled=false telemetry-dev->analytics-dev.enabled=true telemetry-dev->analytics-dev.topics=... analytics-dev->telemetry-dev.enabled=false replication.factor=2 replication.policy.class=org.apache.kafka.connect.mirror.IdentityReplicationPolicy dedicated.mode.enable.internal.rest=true num.streams=3 tasks.max=2 {code} We are eagerly awaiting your response! Best regards, Asker Kakhramanov > MirrorCheckpointConnector does not generate task configs if initial consumer > group load times out > ------------------------------------------------------------------------------------------------- > > Key: KAFKA-17232 > URL: https://issues.apache.org/jira/browse/KAFKA-17232 > Project: Kafka > Issue Type: Bug > Components: mirrormaker > Affects Versions: 3.9.0 > Reporter: Greg Harris > Assignee: TengYao Chi > Priority: Major > Fix For: 3.9.0 > > > The MirrorCheckpointConnector has two operations that read the source > consumer groups: > * loadInitialConsumerGroups > * refreshConsumerGroups > loadInitialConsumerGroups blocks the start() method of the connector, while > refreshConsumerGroups is asynchronous and runs periodically while the > connector is running. > loadInitialConsumerGroups may take a long time to execute, and may exceed the > configured "admin.timeout.ms" used by the Scheduler. This timeout is logged > and the start() method returns normally. If this happens, the framework will > generate task configs immediately after start(), before > loadInitialConsumerGroups can finish, and will generate an empty set of task > configs: > [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L118-L121]. > Later, when loadInitialConsumerGroups completes, it will not request task > reconfiguration, believing it is the initial load operation. > Later still, when refreshConsumerGroups completes, it will not request task > reconfiguration, as the set of consumer groups has not changed since the > initial load: > [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L173-L180] > > This leads to a situation where the MirrorCheckpointConnector believes it has > converged with nothing to update, but actually has consumer groups that are > not allocated to tasks. > This happens particularly for large, stable Kafka clusters with many consumer > groups that are not being actively created or deleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)