[ https://issues.apache.org/jira/browse/KAFKA-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Gray updated KAFKA-13335: ------------------------------ Description: After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, brokers are still on 2.7.0), I am noticing that the cluster is struggling to stabilize. Connectors are being unassigned/reassigned/duplicated continuously, and never settling back down. A downgrade back to 2.7.0 fixes things immediately. I have attached a picture of our Grafana dashboards showing some metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 connectors, each connector with a maxTask of 1. We are noticing a slow increase in memory usage with big random peaks of tasks counts and thread counts. I do also notice over the course of letting 2.8.0 run a huge increase in logs stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, but the logs do not seem to indicate a reason. The connector appears to be stopped only seconds after its creation. It appears to only affect our source connectors. These logs stop after downgrading back to 2.7.0. I am also seeing an increase in logs stating that {code}Couldn't instantiate task (source task name) because it has an invalid task configuration. This task will not execute until reconfigured. (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [StartAndStopExecutor-connect-1-1] org.apache.kafka.connect.errors.ConnectException: Task already exists in this worker: (source task name) at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127) at org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266) at org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834){code} I am not sure what could be causing this, any insight would be appreciated! I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances (KAFKA-10413). Is that fix potentially causing instability? was: After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, brokers are still on 2.7.0), I am noticing that the cluster is struggling to stabilize. Connectors are being unassigned/reassigned/duplicated continuously, and never settling back down. A downgrade back to 2.7.0 fixes things immediately. I have attached a picture of our Grafana dashboards showing some metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 connectors, each connector with a maxTask of 1. We are noticing a slow increase in memory usage with big random peaks of tasks counts and thread counts. I do also notice over the course of letting 2.8.0 run a huge increase in logs stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, but the logs do not seem to indicate a reason. The connector appears to be stopped only seconds after its creation. It appears to only affect our source connectors. These logs stop after downgrading back to 2.7.0. I am not sure what could be causing this, any insight would be appreciated! I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances (KAFKA-10413). Is that fix potentially causing instability? > Upgrading connect from 2.7.0 to 2.8.0 causes worker instability > --------------------------------------------------------------- > > Key: KAFKA-13335 > URL: https://issues.apache.org/jira/browse/KAFKA-13335 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Affects Versions: 2.8.0 > Reporter: John Gray > Priority: Major > Attachments: image-2021-09-29-09-15-18-172.png > > > After recently upgrading our connect cluster to 2.8.0 (via > strimzi+Kubernetes, brokers are still on 2.7.0), I am noticing that the > cluster is struggling to stabilize. Connectors are being > unassigned/reassigned/duplicated continuously, and never settling back down. > A downgrade back to 2.7.0 fixes things immediately. I have attached a picture > of our Grafana dashboards showing some metrics. We have a connect cluster > with 4 nodes, trying to maintain about 1000 connectors, each connector with a > maxTask of 1. > We are noticing a slow increase in memory usage with big random peaks of > tasks counts and thread counts. > I do also notice over the course of letting 2.8.0 run a huge increase in logs > stating that {code}ERROR Graceful stop of task (task name here) > failed.{code}, but the logs do not seem to indicate a reason. The connector > appears to be stopped only seconds after its creation. It appears to only > affect our source connectors. These logs stop after downgrading back to 2.7.0. > I am also seeing an increase in logs stating that {code}Couldn't instantiate > task (source task name) because it has an invalid task configuration. This > task will not execute until reconfigured. > (org.apache.kafka.connect.runtime.distributed.DistributedHerder) > [StartAndStopExecutor-connect-1-1] > org.apache.kafka.connect.errors.ConnectException: Task already exists in this > worker: (source task name) > at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512) > at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251) > at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127) > at > org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266) > at > org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834){code} > I am not sure what could be causing this, any insight would be appreciated! > I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances > (KAFKA-10413). Is that fix potentially causing instability? -- This message was sent by Atlassian Jira (v8.3.4#803005)