[jira] [Comment Edited] (KAFKA-9374) Worker can be disabled by blocked connectors

Chris Egerton (Jira) Tue, 07 Jan 2020 11:06:19 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009998#comment-17009998
 ]


Chris Egerton edited comment on KAFKA-9374 at 1/7/20 7:05 PM:
--------------------------------------------------------------

[~tombentley] yeah, it's not great if a lot threads are created and then 
abandoned. However, I'd like to make the following observations:
 * We already use this approach with connector tasks and it seems to be working 
well enough; the only time in the wild I've seen it be a problem is due to an 
issue in the Connect framework, specifically KAFKA-9051
 * A single poorly-behaved connector should not be able to block the worker's 
REST API; users could have dozens if not hundreds of connectors running on 
their worker
 * If a connector config has already been written to the config topic but the 
connector blocks in its {{start}} method, removing it can be extremely 
difficult if the worker that's blocked by that connector is the leader or if 
the cluster only consists of a single worker; it may even be necessary to 
directly write to the internal config topic to do so

Because we want the worker to remain available even when connectors hang, I 
think sacrificing a thread in this situation is preferable to sacrificing the 
entire worker. If there's a way to avoid continually creating and then 
abandoning new threads when poorly-behaved connectors are created and then 
block indefinitely _and_ keep the worker going with no impact on the REST API 
or other connectors and tasks, we should definitely consider it. However, as 
far as I know, once a Java thread is blocked there is no guaranteed, safe way 
of blocking or terminating it (well, besides 
[this|https://stackoverflow.com/a/32909191/12417563]).

 

As far as failing the connector in a noticeable way–I completely agree. I 
mentioned that we could transition connectors that have at least been created 
(i.e., written to the config topic) to a failed state should they block for too 
long, which would be a start as far as alerting the user that their connector 
is buggy goes. Additionally, if a connector blocks in something like its 
{{validate}} or {{config}} method, we could also fail the REST request that led 
to that method invocation.


was (Author: chrisegerton):
[~tombentley] yeah, it's not great if a lot threads are created and then 
abandoned. However, I'd like to make the following observations:
 * We already use this approach with connector tasks and it seems to be working 
well enough; the only time in the wild I've seen it be a problem is due to an 
issue in the Connect framework, specifically KAFKA-9051
 * A single poorly-behaved connector should not be able to block the worker's 
REST API; users could have dozens if not hundreds of connectors running on 
their worker
 * If a connector config has already been written to the config topic but the 
connector blocks in its {{start}} method, removing it can be extremely 
difficult if the worker that's blocked by that connector is the leader or if 
the cluster only consists of a single worker; it may even be necessary to 
directly write to the internal config topic to do so

Because we want the worker to remain available even when connectors hang, I 
think sacrificing a thread in this situation is preferable to sacrificing the 
entire worker. If there's a way to avoid continually creating threads when 
poorly-behaved connectors are created and then block indefinitely _and_ keep 
the worker going with no impact on the REST API or other connectors and tasks, 
we should definitely consider it. However, as far as I know, once a Java thread 
is blocked there is no guaranteed, safe way of blocking or terminating it 
(well, besides [this|https://stackoverflow.com/a/32909191/12417563]).

 

As far as failing the connector in a noticeable way–I completely agree. I 
mentioned that we could transition connectors that have at least been created 
(i.e., written to the config topic) to a failed state should they block for too 
long, which would be a start as far as alerting the user that their connector 
is buggy goes. Additionally, if a connector blocks in something like its 
{{validate}} or {{config}} method, we could also fail the REST request that led 
to that method invocation.

> Worker can be disabled by blocked connectors
> --------------------------------------------
>
>                 Key: KAFKA-9374
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9374
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, 
> 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1
>            Reporter: Chris Egerton
>            Assignee: Chris Egerton
>            Priority: Major
>
> If a connector hangs during any of its {{initialize}}, {{start}}, {{stop}}, 
> \{taskConfigs}}, {{taskClass}}, {{version}}, {{config}}, or {{validate}} 
> methods, the worker will be disabled for some types of requests thereafter, 
> including connector creation, connector reconfiguration, and connector 
> deletion.
>  This only occurs in distributed mode and is due to the threading model used 
> by the 
> [DistributedHerder|https://github.com/apache/kafka/blob/03f763df8a8d9482d8c099806336f00cf2521465/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java]
>  class.
>  
> One potential solution could be to treat connectors that fail to start, stop, 
> etc. in time similarly to tasks that fail to stop within the [task graceful 
> shutdown timeout 
> period|https://github.com/apache/kafka/blob/03f763df8a8d9482d8c099806336f00cf2521465/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerConfig.java#L121-L126]
>  by handling all connector interactions on a separate thread, waiting for 
> them to complete within a timeout, and abandoning the thread (and 
> transitioning the connector to the {{FAILED}} state, if it has been created 
> at all) if that timeout expires.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-9374) Worker can be disabled by blocked connectors

Reply via email to