thinkharderdev commented on a change in pull request #1983:
URL: https://github.com/apache/arrow-datafusion/pull/1983#discussion_r827714678



##########
File path: ballista/rust/scheduler/src/scheduler_server/event_loop.rs
##########
@@ -107,14 +125,19 @@ impl<T: 'static + AsLogicalPlan, U: 'static + 
AsExecutionPlan>
                             }
                         })
                         .collect::<Vec<String>>(),
-                    executor_data.executor_id
+                    executor_delta_data.executor_id
                 );
                 let mut client = {
                     let clients = self.executors_client.read().await;
-                    clients.get(&executor_data.executor_id).unwrap().clone()
+                    clients
+                        .get(&executor_delta_data.executor_id)
+                        .unwrap()
+                        .clone()
                 };
                 // Update the resources first
-                self.state.save_executor_data(executor_data.clone());
+                self.state
+                    .executor_manager
+                    .update_executor_data(executor_delta_data);
                 // TODO check whether launching task is successful or not
                 client.launch_task(LaunchTaskParams { task: tasks }).await?;

Review comment:
       It was just a thought, not sure if the performance cost would be 
justifies. But the idea is that updating the executor state and launching the 
tasks should in principle be an atomic transaction. If you update first and 
then the launching fails, you resources will never be released but if you 
launch first and then update state then you have a concurrency issue where 
another thread may try and launch tasks after the task are launched and before 
the state is updated so you end up overcommitted. 
   
   The second issue seems less of a problem because it seems pretty 
straightforward to just have the executor reject the task if it has no 
available task slots, so it may be enough to just reverse the order here 
(launch tasks -> update state) and let the existing error handling mechanisms 
deal with it. Since launching requires a network call, holding a write lock on 
the entire state for that operation could be a major performance bottleneck, so 
if we did need a lock it would probably have to be at the individual executor 
level. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to