thinkharderdev commented on PR #184:
URL: https://github.com/apache/arrow-ballista/pull/184#issuecomment-1236811715

   > > Had a chance to review more carefully and I'm not sure I understand how 
this will interact with task statuses that are arriving from the executors. I 
haven't traced all the way through the logic but from what I can tell, if a 
stage is rolled back due to an executor being lost and a task status comes in 
after the rollback (or during the rollback but is queued waiting for the write 
lock on the job) then it will error out since the stage is no longer running. 
But this will cause all task updates in that batch to fail. This is not an 
error we currently handle so it can cause some cascading issues (including 
causing other jobs to stall out entirely since task updates are effectively 
dropped completely).
   > 
   > Good catching. There is some gap here, I will take a look and fix it. In 
this PR, the executor lost handling and task statuses update will not 
interleave with each other, they should be no lock contentions. All the updates 
to the `TaskManager` can only be triggered through `QueryStageSchedulerEvents`. 
The events are queued and processed one by one. Either 
`QueryStageSchedulerEvent::ExecutorLost `happens first or 
`QueryStageSchedulerEvent::TaskUpdating `happens first. If executor lost event 
is happened first, the `QueryStageSchedulerEvent::TaskUpdating` from that 
executor should be ignored. I already have some check in the task_status rpc 
call, there is some gap in the `TaskManager::update_task_status`, I will take a 
look.
   
   The scenario I am concerned about is a task update coming from a different, 
still-alive executor. So the issue wouldn't be interleaving but rather that we 
may consider a valid status update to be an error because it applies to a stage 
that is no longer running (since it has been rolled back). I think we need to 
make sure that scenario is handled in the task status update and we ignore them 
instead of erroring. 
   
   Thinking about this a bit more. I think it would be a good idea to make the 
failure handling configurable. It may not be desired to roll back and recompute 
results. In an interactive query system for example, it may be desirable to 
just fail queries when an executor is lost rather than consume more cluster 
resources attempting to recompute results. Not for this PR of course, but what 
do you think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to