[ 
https://issues.apache.org/jira/browse/KAFKA-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Gray updated KAFKA-13335:
------------------------------
    Description: 
After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, 
brokers are still on 2.7.0), I am noticing that the cluster is struggling to 
stabilize. Connectors are being unassigned/reassigned/duplicated continuously, 
and never settling back down. A downgrade back to 2.7.0 fixes things 
immediately. I have attached a picture of our Grafana dashboards showing some 
metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 
connectors, each connector with a maxTask of 1. 

We are noticing a slow increase in memory usage with big random peaks of tasks 
counts and thread counts.

I do also notice over the course of letting 2.8.0 run a huge increase in logs 
stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, 
but the logs do not seem to indicate a reason. The connector appears to be 
stopped only seconds after its creation. It appears to only affect our source 
connectors. These logs stop after downgrading back to 2.7.0.

I am also seeing an increase in logs stating that {code}Couldn't instantiate 
task (source task name) because it has an invalid task configuration. This task 
will not execute until reconfigured. 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder) 
[StartAndStopExecutor-connect-1-1]
org.apache.kafka.connect.errors.ConnectException: Task already exists in this 
worker: (source task name)
        at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512)
        at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251)
        at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127)
        at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266)
        at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834){code}

I am not sure what could be causing this, any insight would be appreciated! 
I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
(KAFKA-10413). Is that fix potentially causing instability? 

  was:
After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, 
brokers are still on 2.7.0), I am noticing that the cluster is struggling to 
stabilize. Connectors are being unassigned/reassigned/duplicated continuously, 
and never settling back down. A downgrade back to 2.7.0 fixes things 
immediately. I have attached a picture of our Grafana dashboards showing some 
metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 
connectors, each connector with a maxTask of 1. 

We are noticing a slow increase in memory usage with big random peaks of tasks 
counts and thread counts.

I do also notice over the course of letting 2.8.0 run a huge increase in logs 
stating that {code}ERROR Graceful stop of task (task name here) failed.{code}, 
but the logs do not seem to indicate a reason. The connector appears to be 
stopped only seconds after its creation. It appears to only affect our source 
connectors. These logs stop after downgrading back to 2.7.0.

I am not sure what could be causing this, any insight would be appreciated! 
I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
(KAFKA-10413). Is that fix potentially causing instability? 


> Upgrading connect from 2.7.0 to 2.8.0 causes worker instability
> ---------------------------------------------------------------
>
>                 Key: KAFKA-13335
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13335
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.8.0
>            Reporter: John Gray
>            Priority: Major
>         Attachments: image-2021-09-29-09-15-18-172.png
>
>
> After recently upgrading our connect cluster to 2.8.0 (via 
> strimzi+Kubernetes, brokers are still on 2.7.0), I am noticing that the 
> cluster is struggling to stabilize. Connectors are being 
> unassigned/reassigned/duplicated continuously, and never settling back down. 
> A downgrade back to 2.7.0 fixes things immediately. I have attached a picture 
> of our Grafana dashboards showing some metrics. We have a connect cluster 
> with 4 nodes, trying to maintain about 1000 connectors, each connector with a 
> maxTask of 1. 
> We are noticing a slow increase in memory usage with big random peaks of 
> tasks counts and thread counts.
> I do also notice over the course of letting 2.8.0 run a huge increase in logs 
> stating that {code}ERROR Graceful stop of task (task name here) 
> failed.{code}, but the logs do not seem to indicate a reason. The connector 
> appears to be stopped only seconds after its creation. It appears to only 
> affect our source connectors. These logs stop after downgrading back to 2.7.0.
> I am also seeing an increase in logs stating that {code}Couldn't instantiate 
> task (source task name) because it has an invalid task configuration. This 
> task will not execute until reconfigured. 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder) 
> [StartAndStopExecutor-connect-1-1]
> org.apache.kafka.connect.errors.ConnectException: Task already exists in this 
> worker: (source task name)
>       at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266)
>       at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834){code}
> I am not sure what could be causing this, any insight would be appreciated! 
> I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances 
> (KAFKA-10413). Is that fix potentially causing instability? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to