C0urante commented on PR #13530: URL: https://github.com/apache/kafka/pull/13530#issuecomment-1613554645
Hmmm... is there an advantage to enabling users to pause/resume a failed connector? I feel like restarting it is still the natural corrective action in that case, which will also cause task config generation to be retried. Honestly, I think the biggest obstacle to preventing pause/resume for connectors that have failed to generate task configs is that it'll look uglier in the code base, especially if we take a naive approach and introduce some kind of public `WorkerConnector::setState` method (gross). Perhaps we could add a `WorkerConnector::taskConfigs` method that accepts a `boolean reportFailure` parameter (feel free to rename), which controls whether any exceptions encountered while invoking `Connector::taskConfigs` (or anywhere else in the method, really) should be reported to the connector's status listener and result in a state change? It's still a little unclean, but it does reflect the needs of our internal API: in one mode, we want these errors to be thrown to the caller without affecting any other state for the connector, and in another mode, we still want these errors to be thrown (so as to surface in REST responses as 500 errors), but also to result in a change in state for the connector. Also, one more advantage of keeping the state tracking (and reporting) inside the `WorkerConnector` class is that we currently use a [delegating status listener](https://github.com/apache/kafka/blob/30b087ead967b28d459945fe90c80545bf189d1f/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerConnector.java#L98-L99) to record metrics whenever the connector's status is updated. I believe with the current implementation in this PR, there would be a bug in metrics reporting for connectors (LMK if I'm mistaken, though). Finally, regarding augmenting error messages with a note that tasks may still be running--as long as we don't have to wrap any connector-thrown exceptions in order to introduce this additional wording, I'm in favor of it. I just don't want to add another set of stack frames and another link in the "caused by" chain without good reason since it makes things harder to read (I've seen people copy+paste stack traces that end with ["Tolerance exceeded in error handler"](https://github.com/apache/kafka/blob/30b087ead967b28d459945fe90c80545bf189d1f/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L591) and omit the actual root cause, which is... less than helpful). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org