mingmwang commented on PR #184: URL: https://github.com/apache/arrow-ballista/pull/184#issuecomment-1237212842
> > > Had a chance to review more carefully and I'm not sure I understand how this will interact with task statuses that are arriving from the executors. I haven't traced all the way through the logic but from what I can tell, if a stage is rolled back due to an executor being lost and a task status comes in after the rollback (or during the rollback but is queued waiting for the write lock on the job) then it will error out since the stage is no longer running. But this will cause all task updates in that batch to fail. This is not an error we currently handle so it can cause some cascading issues (including causing other jobs to stall out entirely since task updates are effectively dropped completely). > > > > > > Good catching. There is some gap here, I will take a look and fix it. In this PR, the executor lost handling and task statuses update will not interleave with each other, they should be no lock contentions. All the updates to the `TaskManager` can only be triggered through `QueryStageSchedulerEvents`. The events are queued and processed one by one. Either `QueryStageSchedulerEvent::ExecutorLost `happens first or `QueryStageSchedulerEvent::TaskUpdating `happens first. If executor lost event is happened first, the `QueryStageSchedulerEvent::TaskUpdating` from that executor should be ignored. I already have some check in the task_status rpc call, there is some gap in the `TaskManager::update_task_status`, I will take a look. > > The scenario I am concerned about is a task update coming from a different, still-alive executor. So the issue wouldn't be interleaving but rather that we may consider a valid status update to be an error because it applies to a stage that is no longer running (since it has been rolled back). I think we need to make sure that scenario is handled in the task status update and we ignore them instead of erroring. > > Thinking about this a bit more. I think it would be a good idea to make the failure handling configurable. It may not be desired to roll back and recompute results. In an interactive query system for example, it may be desirable to just fail queries when an executor is lost rather than consume more cluster resources attempting to recompute results. Not for this PR of course, but what do you think? Agree, we should make it configurable. Fast fail queries do make sense in some scenario. In future we will definitely re-implement the 'All-At-Once' scheduling and it should fast failed the queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
