vamossagar12 commented on PR #12485:
URL: https://github.com/apache/kafka/pull/12485#issuecomment-1221262360

   Thanks @C0urante . I think those are great points and I now realise the 
mistake I made with the PR. Here's what I am thinking and the answers to the 
rest of your points =>
   
   1) I agree for the most part that `InterruptedException` may or may not be 
thrown and it's pretty much unreliable. Having said that, if it does occur i.e 
the thread in `herderExecutor` does get interrupted, the backing offset kafka 
consumers would definitely get into an infinite loop. IMO, that's something 
that needs to be handled. That's where the original approach in the PR  is 
wrong as it's catching the `InterruptedException` in `consumer.poll` which may 
never even see it. Instead, I am thinking to add it in the 
`DistributedHerder.stop` method. I see it catches `InterruptedException` but 
doesn't do anything about it. What I think that can be done is that once the 
`herderExecutor` thread is interrupted, we can look to force close the Kafka 
consumers. This can be done by invoking `stopServices`. At this point, some of 
the consumers might be closed and some of them might not be, and here we can 
leverage the `stopRequested` flag i.e if this flag is set, then for all 
practical purpo
 ses the log should get closed. So, we can add that check in 
`KafkaConsumer.stop`. Let me know if this is making sense.
   
   - Does standalone mode need to be taken into account as well?
   
   Yeah I think it can have a similar issue so probably that can also be 
accounted for.
   
   - Will any logic that we add in DistributedHerder::stop conflict with the 
logic contained in halt, which will presumably be called once the herder 
finishes/aborts startup?
   
   That's a great point.. It could happen that `stopServices` might be invoked 
from both halt and from `InterruptedException`. So, what we would need to 
ensure is, that calling the `stop` method multiple times should be safe. The 
check that I described using `stopRequested` might be helpful. WDYT?
   
   - Will anything we add to DistributedHerder::stop run the risk of blocking 
indefinitely? If so, what are the potential ramifications?
   
   The approach I described, maybe not. Would love to hear your thoughts on 
this though.
   
   - None of this should result in any error messages being logged unless an 
operation has actually failed (instead of just being terminated prematurely)
   
   +1
   
   Regarding the last 3 points, I am not sure at this moment. Maybe if this 
approach sounds ok, we can get these answered. 
   - Can we ensure that the worker is able to gracefully leave the cluster if 
it's been able to join?
   - Can we ensure that the worker doesn't try to leave the cluster (or at the 
very least, cause any issues by trying to leave the cluster) if it hasn't been 
able to join?
   - If we're able to successfully abort one startup operation (such as reading 
to the end of the offsets topic), can we make sure that we don't even attempt 
any following startup operations (such as reading to the end of the status 
topic, which currently takes place after reading to the end of the offsets 
topic)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to