kevin85421 commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1205832942
Hi @mridulm,
# Task Failure vs Non-Task Failure
We can categorize executor loss reasons into two categories.
(1) Task Failure: The network is good, but the task causes the executor's
JVM crash. Hence, executor loses. Take the following as an example, the tasks
will close executor's JVM by `System.exit(0)`. If the executor loss is caused
by Task Failure, we should increment the variable `numFailures`. If the value
of `numFailures` is larger than a threshold, Spark will label the job failed.
```
sc.parallelize(1 to 10, 1).map{ x => System.exit(0) }.count()
```
(2) Non-Task Failure: If the executor loss is not caused by problematic
tasks such as the above example, it is caused by Non-Task Failure. For
examples,
Example1: The executor's JVM is closed by the failure of directory creation.
Example2: The executor works well, but the network between Driver and
Executor is broken.
In these cases, we should not fail the Spark job. We just need to assign the
tasks to other nodes.
# Why do we need this PR?
Currently, driver will consider every executor loss as Task Failure.
However, some executor losses may be caused by Non-Task Failures (e.g. network
disconnection), and thus some extra Spark job failures may occur. With this PR,
we can categorize the reason of executor loss into Task Failure and Non-Task
Failure, so some extra Spark job failures can be avoided.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]