vamossagar12 commented on PR #12485: URL: https://github.com/apache/kafka/pull/12485#issuecomment-1221262360
Thanks @C0urante . I think those are great points and I now realise the mistake I made with the PR. Here's what I am thinking and the answers to the rest of your points => 1) I agree for the most part that `InterruptedException` may or may not be thrown and it's pretty much unreliable. Having said that, if it does occur i.e the thread in `herderExecutor` does get interrupted, the backing offset kafka consumers would definitely get into an infinite loop. IMO, that's something that needs to be handled. That's where the original approach in the PR is wrong as it's catching the `InterruptedException` in `consumer.poll` which may never even see it. Instead, I am thinking to add it in the `DistributedHerder.stop` method. I see it catches `InterruptedException` but doesn't do anything about it. What I think that can be done is that once the `herderExecutor` thread is interrupted, we can look to force close the Kafka consumers. This can be done by invoking `stopServices`. At this point, some of the consumers might be closed and some of them might not be, and here we can leverage the `stopRequested` flag i.e if this flag is set, then for all practical purpo ses the log should get closed. So, we can add that check in `KafkaConsumer.stop`. Let me know if this is making sense. - Does standalone mode need to be taken into account as well? Yeah I think it can have a similar issue so probably that can also be accounted for. - Will any logic that we add in DistributedHerder::stop conflict with the logic contained in halt, which will presumably be called once the herder finishes/aborts startup? That's a great point.. It could happen that `stopServices` might be invoked from both halt and from `InterruptedException`. So, what we would need to ensure is, that calling the `stop` method multiple times should be safe. The check that I described using `stopRequested` might be helpful. WDYT? - Will anything we add to DistributedHerder::stop run the risk of blocking indefinitely? If so, what are the potential ramifications? The approach I described, maybe not. Would love to hear your thoughts on this though. - None of this should result in any error messages being logged unless an operation has actually failed (instead of just being terminated prematurely) +1 Regarding the last 3 points, I am not sure at this moment. Maybe if this approach sounds ok, we can get these answered. - Can we ensure that the worker is able to gracefully leave the cluster if it's been able to join? - Can we ensure that the worker doesn't try to leave the cluster (or at the very least, cause any issues by trying to leave the cluster) if it hasn't been able to join? - If we're able to successfully abort one startup operation (such as reading to the end of the offsets topic), can we make sure that we don't even attempt any following startup operations (such as reading to the end of the status topic, which currently takes place after reading to the end of the offsets topic) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
