mingmwang commented on PR #184:
URL: https://github.com/apache/arrow-ballista/pull/184#issuecomment-1237212842

   > > > Had a chance to review more carefully and I'm not sure I understand 
how this will interact with task statuses that are arriving from the executors. 
I haven't traced all the way through the logic but from what I can tell, if a 
stage is rolled back due to an executor being lost and a task status comes in 
after the rollback (or during the rollback but is queued waiting for the write 
lock on the job) then it will error out since the stage is no longer running. 
But this will cause all task updates in that batch to fail. This is not an 
error we currently handle so it can cause some cascading issues (including 
causing other jobs to stall out entirely since task updates are effectively 
dropped completely).
   > > 
   > > 
   > > Good catching. There is some gap here, I will take a look and fix it. In 
this PR, the executor lost handling and task statuses update will not 
interleave with each other, they should be no lock contentions. All the updates 
to the `TaskManager` can only be triggered through `QueryStageSchedulerEvents`. 
The events are queued and processed one by one. Either 
`QueryStageSchedulerEvent::ExecutorLost `happens first or 
`QueryStageSchedulerEvent::TaskUpdating `happens first. If executor lost event 
is happened first, the `QueryStageSchedulerEvent::TaskUpdating` from that 
executor should be ignored. I already have some check in the task_status rpc 
call, there is some gap in the `TaskManager::update_task_status`, I will take a 
look.
   > 
   > The scenario I am concerned about is a task update coming from a 
different, still-alive executor. So the issue wouldn't be interleaving but 
rather that we may consider a valid status update to be an error because it 
applies to a stage that is no longer running (since it has been rolled back). I 
think we need to make sure that scenario is handled in the task status update 
and we ignore them instead of erroring.
   > 
   > Thinking about this a bit more. I think it would be a good idea to make 
the failure handling configurable. It may not be desired to roll back and 
recompute results. In an interactive query system for example, it may be 
desirable to just fail queries when an executor is lost rather than consume 
more cluster resources attempting to recompute results. Not for this PR of 
course, but what do you think?
   
   Agree, we should make it configurable.  Fast fail queries do make sense in 
some scenario.  In future we will definitely re-implement the 'All-At-Once' 
scheduling and it should fast failed the queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to