C0urante commented on PR #12485:
URL: https://github.com/apache/kafka/pull/12485#issuecomment-1218216005

   > `awaitTermination` can throw an `InterruptedException` and once that's 
thrown, the KafkaBasedLog remains in an infinite loop since I believe the 
initial readToEnd is off the same thread(switched to a `WorkerThread` later on. 
That's the interrupted exception that I was looking to handle in this PR. Do 
you think that makes sense?
   
   Yes, but my point is that we don't actually have any guarantees that 
`awaitTermination` or `shutdownNow` will throw an `InterruptedException`, or 
cause one to be thrown on the threads for any currently-running tasks for the 
executor. So, even though a JVM that's running Connect _may_ cause an 
`InterruptedException` to be encountered in `KafkaBasedLog::poll`, we should 
not rely on that behavior since a different fully-compliant JVM might not.
   
   It's also worth pointing out that `awaitTermination` will almost certainly 
_not_ cause an `InterruptedException` to be triggered on the executor's 
threads, since the [Javadocs for that 
method](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#awaitTermination-long-java.util.concurrent.TimeUnit-)
 state that it will block "until all tasks have completed execution after a 
shutdown request, or the timeout occurs, or the current thread is interrupted, 
whichever happens first.", and the [Javadocs for 
`shutdown`](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#shutdown--)
 state that it "Initiates an orderly shutdown in which previously submitted 
tasks are executed, but no new tasks will be accepted.". And this is why I was 
discussing the behavior during shutdown after the graceful timeout had already 
expired; the changes here are unlikely to have any effect until that condition 
is reached.
   
   > IIUC, the scenario you have described is if even after the graceful 
shutdown, if the Log isn't stopped and the herder never got to get interrupted, 
the potential infinite loop issue still remains. Is that correct?
   
   Sort of? I think there's two points worth focusing on here. First, we cannot 
and should not rely on thread interruption taking place as a result of invoking 
`shutdown`, `awaitTermination`, or `shutdownNow` on an executor. Second, with 
an ideal approach, we wouldn't have to wait for the graceful shutdown timeout 
to elapse before causing any active reads-to-end of the worker's internal 
topics to halt and return early (either normally or by throwing an exception).
   
   > I think using `KafkaBasedLog::stop` might be a good idea in this case. 
BTW, I see there's something called `stopServices` which is effectively 
stopping these the backing consumers. That inherently calls the 
`KafkaBasedLog::stop` method and it's all wired using `halt` method which 
checks upon a `stopping` flag which is set in `stop` method (BTW, I know you 
know all this, just writing down for my own confirmation :D ) . Do you think 
the situation you described would still arise or the only unhandled case was 
the interruption of the herder executor. WDYT?
   
   I think this approach has some promise (possibly but not necessarily 
leveraging `stopServices` to halt an in-progress startup), but haven't been 
able to think through all the details yet. Some things to consider:
   - Does standalone mode need to be taken into account as well?
   - Will any logic that we add in `DistributedHerder::stop` conflict with the 
logic contained in `halt`, which will presumably be called once the herder 
finishes/aborts startup?
   - Will anything we add to `DistributedHerder::stop` run the risk of blocking 
indefinitely? If so, what are the potential ramifications?
   - None of this should result in any error messages being logged unless an 
operation has actually failed (instead of just being terminated prematurely)
   - Can we ensure that the worker is able to gracefully leave the cluster if 
it's been able to join?
   - Can we ensure that the worker doesn't try to leave the cluster (or at the 
very least, cause any issues by trying to leave the cluster) if it hasn't been 
able to join?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to