[ https://issues.apache.org/jira/browse/KAFKA-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Blumenthal updated KAFKA-8936: ------------------------------------- Priority: Minor (was: Major) > Connect metrics have a chance to disappear on rebalance > ------------------------------------------------------- > > Key: KAFKA-8936 > URL: https://issues.apache.org/jira/browse/KAFKA-8936 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect, metrics > Affects Versions: 2.0.0 > Reporter: Steven Blumenthal > Priority: Minor > > We encountered an interesting problem with our connect cluster. At times, > seemingly randomly, some connect sink task metrics would randomly disappear > from Datadog (which is where we are sending these metrics to). After some > investigation, I noticed that the metrics in question weren't being reported > by the connect servers themselves. > After some more investigation, I noticed that the metrics stopped reporting > after a rebalance was triggered. Our logs were filled with "Graceful stop of > task ... failed". So, further digging to understand what was happening in the > code when this happens, it appears that this error means that the stopping of > tasks timed out for whatever reason, and the connect cluster will no longer > wait for them to stop. They will still stop eventually, but in the meantime > new tasks can be spun up. > ([Worker.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L587]], > which calls > [WorkerTask.java:cancel()|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L120]]) > So, new tasks are being spun up, and begin consuming records and doing work. > Then, at some point, the old task is removed, and the very last thing that > happens when the old task is removed is that the metric group associated with > that task is removed. > ([WorkerTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L232]] > which, in this case, calls > [WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L179]]) > The issue with this is that task based metrics are registered based on a set > of tags that one would expect to not change during runtime. Meaning that, > when the old task IS EVENTUALLY REMOVED, it is removing the metric group that > the new task is using (if the new task came up on the same connect node that > the old task was running on). > ([WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L721]]) > I tried increasing the "task.shutdown.graceful.timeout.ms" config by 3 times > what it had previously been set to, however that did not completely remove > the problem. Also, even if it did, it doesn't change the fact that a minor > network blip on my connect cluster could result in us needing to redeploy the > code simply because metrics went missing due to task shut downs taking longer > than intended. -- This message was sent by Atlassian Jira (v8.3.4#803005)