jihoonson opened a new issue #7828: Supervisor marks succeeded replicas as 
failed too aggressively
URL: https://github.com/apache/incubator-druid/issues/7828
 
 
   ### Affected Version
   
   All versions since 0.9.1.
   
   ### Description
   
   The `seekableSupervisor` does the below when a replica is succeeded.
   
   - Check the status of all other replicas from `taskStorage`.
   - Stop all replicas if they are not finished yet. 
     - For the tasks of unknown status, the supervisor kills them.
     - If the stop request fails for some tasks, the supervisor kills them.
   
   However, there's some race in this algorithm because task status is not 
updated in real time. Instead, the supervisor updates it per `runNotice`. As a 
result, the supervisor can kill some already finished tasks successfully if 
their status is not updated yet. This would lead to mark them as failed even 
though they are finished as succeeded in the task logs, which seems very 
confused.
   
   One way to workaround this problem is to check task status more eagerly. 
However, this would just mitigate this issue happening less. I think we 
eventually need the following changes in the future.
   
   - Updating task status immediately when the status change is notified to the 
overlord.
   - Add a new task status for canceled tasks.
   
   I'm seeing this problem happening very frequently in our cluster and so 
marking as a release blocker fo 0.15.0.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to