kevin85421 commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1205832942

   Hi @mridulm,
   
   # Task Failure vs Non-Task Failure
   We can categorize executor loss reasons into two categories.
   
   (1) Task Failure: The network is good, but the task causes the executor's 
JVM crash. Hence, executor loses. Take the following as an example, the tasks 
will close executor's JVM by `System.exit(0)`. If the executor loss is caused 
by Task Failure, we should increment the variable `numFailures`. If the value 
of `numFailures` is larger than a threshold, Spark will label the job failed.
   ```
   sc.parallelize(1 to 10, 1).map{ x => System.exit(0) }.count()
   ```
   
   (2) Non-Task Failure: If the executor loss is not caused by problematic 
tasks such as the above example, it is caused by Non-Task Failure. For 
examples, 
   
   Example1: The executor's JVM is closed by the failure of directory creation. 
   Example2: The executor works well, but the network between Driver and 
Executor is broken.
   
   In these cases, we should not fail the Spark job. We just need to assign the 
tasks to other nodes.
   
   # Why do we need this PR?
   Currently, driver will consider every executor loss as Task Failure. 
However, some executor losses may be caused by Non-Task Failures (e.g. network 
disconnection), and thus some extra Spark job failures may occur. With this PR, 
we can categorize the reason of executor loss into Task Failure and Non-Task 
Failure, so some extra Spark job failures can be avoided.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to