C0urante commented on PR #12485: URL: https://github.com/apache/kafka/pull/12485#issuecomment-1218216005
> `awaitTermination` can throw an `InterruptedException` and once that's thrown, the KafkaBasedLog remains in an infinite loop since I believe the initial readToEnd is off the same thread(switched to a `WorkerThread` later on. That's the interrupted exception that I was looking to handle in this PR. Do you think that makes sense? Yes, but my point is that we don't actually have any guarantees that `awaitTermination` or `shutdownNow` will throw an `InterruptedException`, or cause one to be thrown on the threads for any currently-running tasks for the executor. So, even though a JVM that's running Connect _may_ cause an `InterruptedException` to be encountered in `KafkaBasedLog::poll`, we should not rely on that behavior since a different fully-compliant JVM might not. It's also worth pointing out that `awaitTermination` will almost certainly _not_ cause an `InterruptedException` to be triggered on the executor's threads, since the [Javadocs for that method](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#awaitTermination-long-java.util.concurrent.TimeUnit-) state that it will block "until all tasks have completed execution after a shutdown request, or the timeout occurs, or the current thread is interrupted, whichever happens first.", and the [Javadocs for `shutdown`](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#shutdown--) state that it "Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted.". And this is why I was discussing the behavior during shutdown after the graceful timeout had already expired; the changes here are unlikely to have any effect until that condition is reached. > IIUC, the scenario you have described is if even after the graceful shutdown, if the Log isn't stopped and the herder never got to get interrupted, the potential infinite loop issue still remains. Is that correct? Sort of? I think there's two points worth focusing on here. First, we cannot and should not rely on thread interruption taking place as a result of invoking `shutdown`, `awaitTermination`, or `shutdownNow` on an executor. Second, with an ideal approach, we wouldn't have to wait for the graceful shutdown timeout to elapse before causing any active reads-to-end of the worker's internal topics to halt and return early (either normally or by throwing an exception). > I think using `KafkaBasedLog::stop` might be a good idea in this case. BTW, I see there's something called `stopServices` which is effectively stopping these the backing consumers. That inherently calls the `KafkaBasedLog::stop` method and it's all wired using `halt` method which checks upon a `stopping` flag which is set in `stop` method (BTW, I know you know all this, just writing down for my own confirmation :D ) . Do you think the situation you described would still arise or the only unhandled case was the interruption of the herder executor. WDYT? I think this approach has some promise (possibly but not necessarily leveraging `stopServices` to halt an in-progress startup), but haven't been able to think through all the details yet. Some things to consider: - Does standalone mode need to be taken into account as well? - Will any logic that we add in `DistributedHerder::stop` conflict with the logic contained in `halt`, which will presumably be called once the herder finishes/aborts startup? - Will anything we add to `DistributedHerder::stop` run the risk of blocking indefinitely? If so, what are the potential ramifications? - None of this should result in any error messages being logged unless an operation has actually failed (instead of just being terminated prematurely) - Can we ensure that the worker is able to gracefully leave the cluster if it's been able to join? - Can we ensure that the worker doesn't try to leave the cluster (or at the very least, cause any issues by trying to leave the cluster) if it hasn't been able to join? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
