[ 
https://issues.apache.org/jira/browse/KAFKA-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Blumenthal updated KAFKA-8936:
-------------------------------------
    Priority: Minor  (was: Major)

> Connect metrics have a chance to disappear on rebalance
> -------------------------------------------------------
>
>                 Key: KAFKA-8936
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8936
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect, metrics
>    Affects Versions: 2.0.0
>            Reporter: Steven Blumenthal
>            Priority: Minor
>
> We encountered an interesting problem with our connect cluster. At times, 
> seemingly randomly, some connect sink task metrics would randomly disappear 
> from Datadog (which is where we are sending these metrics to). After some 
> investigation, I noticed that the metrics in question weren't being reported 
> by the connect servers themselves.
> After some more investigation, I noticed that the metrics stopped reporting 
> after a rebalance was triggered. Our logs were filled with "Graceful stop of 
> task ... failed". So, further digging to understand what was happening in the 
> code when this happens, it appears that this error means that the stopping of 
> tasks timed out for whatever reason, and the connect cluster will no longer 
> wait for them to stop. They will still stop eventually, but in the meantime 
> new tasks can be spun up. 
> ([Worker.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L587]],
>  which calls 
> [WorkerTask.java:cancel()|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L120]])
> So, new tasks are being spun up, and begin consuming records and doing work. 
> Then, at some point, the old task is removed, and the very last thing that 
> happens when the old task is removed is that the metric group associated with 
> that task is removed. 
> ([WorkerTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L232]]
>  which, in this case, calls 
> [WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L179]])
> The issue with this is that task based metrics are registered based on a set 
> of tags that one would expect to not change during runtime. Meaning that, 
> when the old task IS EVENTUALLY REMOVED, it is removing the metric group that 
> the new task is using (if the new task came up on the same connect node that 
> the old task was running on). 
> ([WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L721]])
> I tried increasing the "task.shutdown.graceful.timeout.ms" config by 3 times 
> what it had previously been set to, however that did not completely remove 
> the problem. Also, even if it did, it doesn't change the fact that a minor 
> network blip on my connect cluster could result in us needing to redeploy the 
> code simply because metrics went missing due to task shut downs taking longer 
> than intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to