[GitHub] [arrow-ballista] mingmwang commented on a diff in pull request #261: Task level retry and Stage level retry

GitBox Sat, 01 Oct 2022 20:58:27 -0700


mingmwang commented on code in PR #261:
URL: https://github.com/apache/arrow-ballista/pull/261#discussion_r985174589



##########
ballista/rust/core/proto/ballista.proto:
##########
@@ -664,15 +698,48 @@ message RunningTask {
 
 message FailedTask {
   string error = 1;
+  bool retryable = 2;
+  // Whether this task failure should be counted to the maximum number of 
times the task is allowed to retry
+  bool count_to_failures = 3;
+  oneof failed_reason {
+    ExecutionError execution_error = 4;
+    FetchPartitionError fetch_partition_error = 5;
+    IOError io_error = 6;
+    ExecutorLost executor_lost = 7;
+    // A successful task's result is lost due to executor lost
+    ResultLost result_lost = 8;
+    TaskKilled task_killed = 9;
+  }
 }
 
-message CompletedTask {
+message SuccessfulTask {
   string executor_id = 1;
   // TODO tasks are currently always shuffle writes but this will not always 
be the case
   // so we might want to think about some refactoring of the task definitions
   repeated ShuffleWritePartition partitions = 2;
 }
 
+message ExecutionError {
+}
+
+message FetchPartitionError {
+  string executor_id = 1;
+  uint32 map_stage_id = 2;
+  uint32 map_partition_id = 3;
+}
+
+message IOError {
+}
+
+message ExecutorLost {
+}
+
+message ResultLost {
+}

Review Comment:
   > Not sure I understand the difference between these two errors
   
   Yes, good question. In the current code base, both the two errors are not 
used directly by the executor tasks.
   They are used by the Scheduler. When we see a 'FetchPartitionError' task 
update from the reduce task, the related map task's status is changed to  
'ResultLost'.  Of cause most of the time `ResultLost` should be caused by 
`ExecutorLost`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-ballista] mingmwang commented on a diff in pull request #261: Task level retry and Stage level retry

Reply via email to