[ 
https://issues.apache.org/jira/browse/KAFKA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryanne Dolan updated KAFKA-12726:
---------------------------------
    Description: 
We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck 
in a retry loop). Despite Connect supporting a property 
task.shutdown.graceful.timeout.ms, this is currently not enforced – tasks can 
take as long as they want to stop, and the only consequence is an error message.

We've seen a Worker's "task-count" metric double following a rebalance, which 
we think is due to Tasks not getting cleaned up when Task.stop() is stuck.

While the Connector implementation is ultimately to blame here – a Task 
probably shouldn't loop forever in stop() – we believe the Connect runtime 
should handle this situation more gracefully.

  was:
We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck 
in a retry loop). Despite Connect supporting a property 
task.shutdown.graceful.timeout.ms, this is currently not enforced -- tasks can 
take as long as they want to stop, and the only consequence is an error message.

Unfortunately, Workers stop Tasks sequentially, meaning that a stuck Task can 
prevent any further Tasks from stopping. Moreover, after a rebalance, these 
lingering tasks can persist along with their replacements. For example, we've 
seen a Worker's "task-count" metric double following a rebalance.

While the Connector implementation is ultimately to blame here -- a Task 
probably shouldn't loop forever in stop() -- we believe the Connect runtime 
should handle this situation more gracefully.


> misbehaving Task.stop() can prevent other Tasks from stopping
> -------------------------------------------------------------
>
>                 Key: KAFKA-12726
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12726
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.8.0
>            Reporter: Ryanne Dolan
>            Assignee: Ryanne Dolan
>            Priority: Minor
>
> We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck 
> in a retry loop). Despite Connect supporting a property 
> task.shutdown.graceful.timeout.ms, this is currently not enforced – tasks can 
> take as long as they want to stop, and the only consequence is an error 
> message.
> We've seen a Worker's "task-count" metric double following a rebalance, which 
> we think is due to Tasks not getting cleaned up when Task.stop() is stuck.
> While the Connector implementation is ultimately to blame here – a Task 
> probably shouldn't loop forever in stop() – we believe the Connect runtime 
> should handle this situation more gracefully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to