GitHub user vanzin opened a pull request:
https://github.com/apache/spark/pull/21243
[SPARK-24182][yarn] Improve error message when client AM fails.
Instead of always throwing a generic exception when the AM fails,
print a generic error and throw the exception with the YARN
diagnostics containing the reason for the failure.
There was an issue with YARN sometimes providing a generic diagnostic
message, even though the AM provides a failure reason when
unregistering. That was happening because the AM was registering
too late, and if errors happened before the registration, YARN would
just create a generic "ExitCodeException" which wasn't very helpful.
Since most errors in this path are a result of not being able to
connect to the driver, this change modifies the AM registration
a bit so that the AM is registered before the connection to the
driver is established. That way, errors are properly propagated
through YARN back to the driver.
As part of that, I also removed the code that retried connections
to the driver from the client AM. At that point, the driver should
already be up and waiting for connections, so it's unlikely that
retrying would help - and in case it does, that means a flaky
network, which would mean problems would probably show up again.
The effect of that is that connection-related errors are reported
back to the driver much faster now (through the YARN report).
One thing to note is that there seems to be a race on the YARN
side that causes a report to be sent to the client without the
corresponding diagnostics string from the AM; the diagnostics are
available later from the RM web page. For that reason, the generic
error messages are kept in the Spark scheduler code, to help
guide users to a way of debugging their failure.
Also of note is that if YARN's max attempts configuration is lower
than Spark's, Spark will not unregister the AM with a proper
diagnostics message. Unfortunately there seems to be no way to
unregister the AM and still allow further re-attempts to happen.
Testing:
- existing unit tests
- some of our integration tests
- hardcoded an invalid driver address in the code and verified
the error in the shell. e.g.
```
scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN
application has exited unexpectedly with state FAILED! Check the YARN
application logs for more details.
18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics
message: Uncaught exception: org.apache.spark.SparkException: Exception thrown
in awaitResult:
<AM stack trace>
Caused by: java.io.IOException: Failed to connect to
localhost/127.0.0.1:1234
<More stack trace>
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vanzin/spark SPARK-24182
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21243.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21243
----
commit a8c223df3aaf4a0ad0905494cdc21c11c097392b
Author: Marcelo Vanzin <vanzin@...>
Date: 2018-05-04T18:07:30Z
[SPARK-24182][yarn] Improve error message when client AM fails.
Instead of always throwing a generic exception when the AM fails,
print a generic error and throw the exception with the YARN
diagnostics containing the reason for the failure.
There was an issue with YARN sometimes providing a generic diagnostic
message, even though the AM provides a failure reason when
unregistering. That was happening because the AM was registering
too late, and if errors happened before the registration, YARN would
just create a generic "ExitCodeException" which wasn't very helpful.
Since most errors in this path are a result of not being able to
connect to the driver, this change modifies the AM registration
a bit so that the AM is registered before the connection to the
driver is established. That way, errors are properly propagated
through YARN back to the driver.
As part of that, I also removed the code that retried connections
to the driver from the client AM. At that point, the driver should
already be up and waiting for connections, so it's unlikely that
retrying would help - and in case it does, that means a flaky
network, which would mean problems would probably show up again.
The effect of that is that connection-related errors are reported
back to the driver much faster now (through the YARN report).
One thing to note is that there seems to be a race on the YARN
side that causes a report to be sent to the client without the
corresponding diagnostics string from the AM; the diagnostics are
available later from the RM web page. For that reason, the generic
error messages are kept in the Spark scheduler code, to help
guide users to a way of debugging their failure.
Also of note is that if YARN's max attempts configuration is lower
than Spark's, Spark will not unregister the AM with a proper
diagnostics message. Unfortunately there seems to be no way to
unregister the AM and still allow further re-attempts to happen.
Testing:
- existing unit tests
- some of our integration tests
- hardcoded an invalid driver address in the code and verified
the error in the shell. e.g.
```
scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN
application has exited unexpectedly with state FAILED! Check the YARN
application logs for more details.
18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics
message: Uncaught exception: org.apache.spark.SparkException: Exception thrown
in awaitResult:
<AM stack trace>
Caused by: java.io.IOException: Failed to connect to
localhost/127.0.0.1:1234
<More stack trace>
```
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]