[GitHub] [helix] jiajunwang opened a new issue #1512: A possible race condition causes ERROR task and block the job.

GitBox Fri, 06 Nov 2020 11:07:46 -0800


jiajunwang opened a new issue #1512:
URL: https://github.com/apache/helix/issues/1512



   ### Describe the bug
   This issue is believed to be triggered by PR 
https://github.com/apache/helix/commit/f11396e5feebe20552d259e553342c17a8573a8e
   The theory is that the PR logic is following the design and OK. But it 
triggers a potential problem.
   
   1. We agreed that if the state transition task schedule fails, we should 
make the partition in ERROR state/
   2. When we close a participant, we shut down the executor first. But since 
the callback handler is still alive, there is a race condition that the handler 
will try to execute a state transition. Since the thread pool already shutdown, 
it fails. And the partition is in ERROR state.
   3. In most cases it is fine since when we shut down the thread pool the 
participant will be shut down immediately. And no thread pool so no side effect 
to the real application logic
   4. Unfortunately, TF may get the error state during the race condition. And 
I find it will stop processing the job due to this ERROR task, even though the 
participant has been shutdown. No live instance.
   
   I notice this because TestTaskRebalancerFailover becomes unstable due to 
this race condition.
   Two potential ways to fix:
   1. Change TF logic so it ignores the ERROR partition in an offline node.
   2. Or we fix the participant shutdown process.
   
   ### To Reproduce
   Run TestTaskRebalancerFailover several times and it usually stuck on 2nd or 
3rd try.
   
   ### Expected behavior
   The job shall finish.
   
   ### Additional context
   Add any other context about the problem here.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [helix] jiajunwang opened a new issue #1512: A possible race condition causes ERROR task and block the job.

Reply via email to