jiajunwang opened a new issue #1512: URL: https://github.com/apache/helix/issues/1512
### Describe the bug This issue is believed to be triggered by PR https://github.com/apache/helix/commit/f11396e5feebe20552d259e553342c17a8573a8e The theory is that the PR logic is following the design and OK. But it triggers a potential problem. 1. We agreed that if the state transition task schedule fails, we should make the partition in ERROR state/ 2. When we close a participant, we shut down the executor first. But since the callback handler is still alive, there is a race condition that the handler will try to execute a state transition. Since the thread pool already shutdown, it fails. And the partition is in ERROR state. 3. In most cases it is fine since when we shut down the thread pool the participant will be shut down immediately. And no thread pool so no side effect to the real application logic 4. Unfortunately, TF may get the error state during the race condition. And I find it will stop processing the job due to this ERROR task, even though the participant has been shutdown. No live instance. I notice this because TestTaskRebalancerFailover becomes unstable due to this race condition. Two potential ways to fix: 1. Change TF logic so it ignores the ERROR partition in an offline node. 2. Or we fix the participant shutdown process. ### To Reproduce Run TestTaskRebalancerFailover several times and it usually stuck on 2nd or 3rd try. ### Expected behavior The job shall finish. ### Additional context Add any other context about the problem here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
