[GitHub] spark pull request #21243: [SPARK-24182][yarn] Improve error message when cl...

vanzin Fri, 04 May 2018 16:11:39 -0700

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/21243


    [SPARK-24182][yarn] Improve error message when client AM fails.

    Instead of always throwing a generic exception when the AM fails,
    print a generic error and throw the exception with the YARN
    diagnostics containing the reason for the failure.
    
    There was an issue with YARN sometimes providing a generic diagnostic
    message, even though the AM provides a failure reason when
    unregistering. That was happening because the AM was registering
    too late, and if errors happened before the registration, YARN would
    just create a generic "ExitCodeException" which wasn't very helpful.
    
    Since most errors in this path are a result of not being able to
    connect to the driver, this change modifies the AM registration
    a bit so that the AM is registered before the connection to the
    driver is established. That way, errors are properly propagated
    through YARN back to the driver.
    
    As part of that, I also removed the code that retried connections
    to the driver from the client AM. At that point, the driver should
    already be up and waiting for connections, so it's unlikely that
    retrying would help - and in case it does, that means a flaky
    network, which would mean problems would probably show up again.
    The effect of that is that connection-related errors are reported
    back to the driver much faster now (through the YARN report).
    
    One thing to note is that there seems to be a race on the YARN
    side that causes a report to be sent to the client without the
    corresponding diagnostics string from the AM; the diagnostics are
    available later from the RM web page. For that reason, the generic
    error messages are kept in the Spark scheduler code, to help
    guide users to a way of debugging their failure.
    
    Also of note is that if YARN's max attempts configuration is lower
    than Spark's, Spark will not unregister the AM with a proper
    diagnostics message. Unfortunately there seems to be no way to
    unregister the AM and still allow further re-attempts to happen.
    
    Testing:
    - existing unit tests
    - some of our integration tests
    - hardcoded an invalid driver address in the code and verified
      the error in the shell. e.g.
    
    ```
    scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN 
application has exited unexpectedly with state FAILED! Check the YARN 
application logs for more details.
    18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics 
message: Uncaught exception: org.apache.spark.SparkException: Exception thrown 
in awaitResult:
      <AM stack trace>
    Caused by: java.io.IOException: Failed to connect to 
localhost/127.0.0.1:1234
      <More stack trace>
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-24182

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21243
    
----
commit a8c223df3aaf4a0ad0905494cdc21c11c097392b
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-05-04T18:07:30Z

    [SPARK-24182][yarn] Improve error message when client AM fails.
    
    Instead of always throwing a generic exception when the AM fails,
    print a generic error and throw the exception with the YARN
    diagnostics containing the reason for the failure.
    
    There was an issue with YARN sometimes providing a generic diagnostic
    message, even though the AM provides a failure reason when
    unregistering. That was happening because the AM was registering
    too late, and if errors happened before the registration, YARN would
    just create a generic "ExitCodeException" which wasn't very helpful.
    
    Since most errors in this path are a result of not being able to
    connect to the driver, this change modifies the AM registration
    a bit so that the AM is registered before the connection to the
    driver is established. That way, errors are properly propagated
    through YARN back to the driver.
    
    As part of that, I also removed the code that retried connections
    to the driver from the client AM. At that point, the driver should
    already be up and waiting for connections, so it's unlikely that
    retrying would help - and in case it does, that means a flaky
    network, which would mean problems would probably show up again.
    The effect of that is that connection-related errors are reported
    back to the driver much faster now (through the YARN report).
    
    One thing to note is that there seems to be a race on the YARN
    side that causes a report to be sent to the client without the
    corresponding diagnostics string from the AM; the diagnostics are
    available later from the RM web page. For that reason, the generic
    error messages are kept in the Spark scheduler code, to help
    guide users to a way of debugging their failure.
    
    Also of note is that if YARN's max attempts configuration is lower
    than Spark's, Spark will not unregister the AM with a proper
    diagnostics message. Unfortunately there seems to be no way to
    unregister the AM and still allow further re-attempts to happen.
    
    Testing:
    - existing unit tests
    - some of our integration tests
    - hardcoded an invalid driver address in the code and verified
      the error in the shell. e.g.
    
    ```
    scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN 
application has exited unexpectedly with state FAILED! Check the YARN 
application logs for more details.
    18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics 
message: Uncaught exception: org.apache.spark.SparkException: Exception thrown 
in awaitResult:
      <AM stack trace>
    Caused by: java.io.IOException: Failed to connect to 
localhost/127.0.0.1:1234
      <More stack trace>
    ```

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21243: [SPARK-24182][yarn] Improve error message when cl...

Reply via email to