[ https://issues.apache.org/jira/browse/KAFKA-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009998#comment-17009998 ]
Chris Egerton commented on KAFKA-9374: -------------------------------------- [~tombentley] yeah, it's not great if a lot threads are created and then abandoned. However, I'd like to make the following observations: * We already use this approach with connector tasks and it seems to be working well enough; the only time in the wild I've seen it be a problem is due to an issue in the Connect framework, specifically KAFKA-9051 * A single poorly-behaved connector should not be able to block the worker's REST API; users could have dozens if not hundreds of connectors running on their worker * If a connector config has already been written to the config topic but the connector blocks in its {{start}} method, removing it can be extremely difficult if the worker that's blocked by that connector is the leader or if the cluster only consists of a single worker; it may even be necessary to directly write to the internal config topic to do so Because we want the worker to remain available even when connectors hang, I think sacrificing a thread in this situation is preferable to sacrificing the entire worker. If there's a way to avoid continually creating threads when poorly-behaved connectors are created and then block indefinitely _and_ keep the worker going with no impact on the REST API or other connectors and tasks, we should definitely consider it. However, as far as I know, once a Java thread is blocked there is no guaranteed, safe way of blocking or terminating it (well, besides [this|https://stackoverflow.com/a/32909191/12417563]). As far as failing the connector in a noticeable way–I completely agree. I mentioned that we could transition connectors that have at least been created (i.e., written to the config topic) to a failed state should they block for too long, which would be a start as far as alerting the user that their connector is buggy goes. Additionally, if a connector blocks in something like its {{validate}} or {{config}} method, we could also fail the REST request that led to that method invocation. > Worker can be disabled by blocked connectors > -------------------------------------------- > > Key: KAFKA-9374 > URL: https://issues.apache.org/jira/browse/KAFKA-9374 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, > 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1 > Reporter: Chris Egerton > Assignee: Chris Egerton > Priority: Major > > If a connector hangs during any of its {{initialize}}, {{start}}, {{stop}}, > \{taskConfigs}}, {{taskClass}}, {{version}}, {{config}}, or {{validate}} > methods, the worker will be disabled for some types of requests thereafter, > including connector creation, connector reconfiguration, and connector > deletion. > This only occurs in distributed mode and is due to the threading model used > by the > [DistributedHerder|https://github.com/apache/kafka/blob/03f763df8a8d9482d8c099806336f00cf2521465/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java] > class. > > One potential solution could be to treat connectors that fail to start, stop, > etc. in time similarly to tasks that fail to stop within the [task graceful > shutdown timeout > period|https://github.com/apache/kafka/blob/03f763df8a8d9482d8c099806336f00cf2521465/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerConfig.java#L121-L126] > by handling all connector interactions on a separate thread, waiting for > them to complete within a timeout, and abandoning the thread (and > transitioning the connector to the {{FAILED}} state, if it has been created > at all) if that timeout expires. -- This message was sent by Atlassian Jira (v8.3.4#803005)