mingmwang commented on code in PR #261:
URL: https://github.com/apache/arrow-ballista/pull/261#discussion_r985174589
##########
ballista/rust/core/proto/ballista.proto:
##########
@@ -664,15 +698,48 @@ message RunningTask {
message FailedTask {
string error = 1;
+ bool retryable = 2;
+ // Whether this task failure should be counted to the maximum number of
times the task is allowed to retry
+ bool count_to_failures = 3;
+ oneof failed_reason {
+ ExecutionError execution_error = 4;
+ FetchPartitionError fetch_partition_error = 5;
+ IOError io_error = 6;
+ ExecutorLost executor_lost = 7;
+ // A successful task's result is lost due to executor lost
+ ResultLost result_lost = 8;
+ TaskKilled task_killed = 9;
+ }
}
-message CompletedTask {
+message SuccessfulTask {
string executor_id = 1;
// TODO tasks are currently always shuffle writes but this will not always
be the case
// so we might want to think about some refactoring of the task definitions
repeated ShuffleWritePartition partitions = 2;
}
+message ExecutionError {
+}
+
+message FetchPartitionError {
+ string executor_id = 1;
+ uint32 map_stage_id = 2;
+ uint32 map_partition_id = 3;
+}
+
+message IOError {
+}
+
+message ExecutorLost {
+}
+
+message ResultLost {
+}
Review Comment:
> Not sure I understand the difference between these two errors
Yes, good question. In the current code base, both the two errors are not
used directly by the executor tasks.
They are used by the Scheduler. When we see a 'FetchPartitionError' task
update from the reduce task, the related map task's status is changed to
'ResultLost'. Of cause most of the time `ResultLost` should be caused by
`ExecutorLost`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]