[
https://issues.apache.org/jira/browse/KAFKA-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009998#comment-17009998
]
Chris Egerton edited comment on KAFKA-9374 at 1/7/20 7:05 PM:
--------------------------------------------------------------
[~tombentley] yeah, it's not great if a lot threads are created and then
abandoned. However, I'd like to make the following observations:
* We already use this approach with connector tasks and it seems to be working
well enough; the only time in the wild I've seen it be a problem is due to an
issue in the Connect framework, specifically KAFKA-9051
* A single poorly-behaved connector should not be able to block the worker's
REST API; users could have dozens if not hundreds of connectors running on
their worker
* If a connector config has already been written to the config topic but the
connector blocks in its {{start}} method, removing it can be extremely
difficult if the worker that's blocked by that connector is the leader or if
the cluster only consists of a single worker; it may even be necessary to
directly write to the internal config topic to do so
Because we want the worker to remain available even when connectors hang, I
think sacrificing a thread in this situation is preferable to sacrificing the
entire worker. If there's a way to avoid continually creating and then
abandoning new threads when poorly-behaved connectors are created and then
block indefinitely _and_ keep the worker going with no impact on the REST API
or other connectors and tasks, we should definitely consider it. However, as
far as I know, once a Java thread is blocked there is no guaranteed, safe way
of blocking or terminating it (well, besides
[this|https://stackoverflow.com/a/32909191/12417563]).
As far as failing the connector in a noticeable way–I completely agree. I
mentioned that we could transition connectors that have at least been created
(i.e., written to the config topic) to a failed state should they block for too
long, which would be a start as far as alerting the user that their connector
is buggy goes. Additionally, if a connector blocks in something like its
{{validate}} or {{config}} method, we could also fail the REST request that led
to that method invocation.
was (Author: chrisegerton):
[~tombentley] yeah, it's not great if a lot threads are created and then
abandoned. However, I'd like to make the following observations:
* We already use this approach with connector tasks and it seems to be working
well enough; the only time in the wild I've seen it be a problem is due to an
issue in the Connect framework, specifically KAFKA-9051
* A single poorly-behaved connector should not be able to block the worker's
REST API; users could have dozens if not hundreds of connectors running on
their worker
* If a connector config has already been written to the config topic but the
connector blocks in its {{start}} method, removing it can be extremely
difficult if the worker that's blocked by that connector is the leader or if
the cluster only consists of a single worker; it may even be necessary to
directly write to the internal config topic to do so
Because we want the worker to remain available even when connectors hang, I
think sacrificing a thread in this situation is preferable to sacrificing the
entire worker. If there's a way to avoid continually creating threads when
poorly-behaved connectors are created and then block indefinitely _and_ keep
the worker going with no impact on the REST API or other connectors and tasks,
we should definitely consider it. However, as far as I know, once a Java thread
is blocked there is no guaranteed, safe way of blocking or terminating it
(well, besides [this|https://stackoverflow.com/a/32909191/12417563]).
As far as failing the connector in a noticeable way–I completely agree. I
mentioned that we could transition connectors that have at least been created
(i.e., written to the config topic) to a failed state should they block for too
long, which would be a start as far as alerting the user that their connector
is buggy goes. Additionally, if a connector blocks in something like its
{{validate}} or {{config}} method, we could also fail the REST request that led
to that method invocation.
> Worker can be disabled by blocked connectors
> --------------------------------------------
>
> Key: KAFKA-9374
> URL: https://issues.apache.org/jira/browse/KAFKA-9374
> Project: Kafka
> Issue Type: Bug
> Components: KafkaConnect
> Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0,
> 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1
> Reporter: Chris Egerton
> Assignee: Chris Egerton
> Priority: Major
>
> If a connector hangs during any of its {{initialize}}, {{start}}, {{stop}},
> \{taskConfigs}}, {{taskClass}}, {{version}}, {{config}}, or {{validate}}
> methods, the worker will be disabled for some types of requests thereafter,
> including connector creation, connector reconfiguration, and connector
> deletion.
> This only occurs in distributed mode and is due to the threading model used
> by the
> [DistributedHerder|https://github.com/apache/kafka/blob/03f763df8a8d9482d8c099806336f00cf2521465/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java]
> class.
>
> One potential solution could be to treat connectors that fail to start, stop,
> etc. in time similarly to tasks that fail to stop within the [task graceful
> shutdown timeout
> period|https://github.com/apache/kafka/blob/03f763df8a8d9482d8c099806336f00cf2521465/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerConfig.java#L121-L126]
> by handling all connector interactions on a separate thread, waiting for
> them to complete within a timeout, and abandoning the thread (and
> transitioning the connector to the {{FAILED}} state, if it has been created
> at all) if that timeout expires.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)