[GitHub] spark pull request #19741: [SPARK-14228][CORE][YARN] Lost executor of RPC di...

devaraj-kavali Mon, 13 Nov 2017 18:48:07 -0800

GitHub user devaraj-kavali opened a pull request:

    https://github.com/apache/spark/pull/19741


    [SPARK-14228][CORE][YARN] Lost executor of RPC disassociated, and occurs 
exception: Could not find CoarseGrainedScheduler or it has been stopped

    ## What changes were proposed in this pull request?
    I see the two instances where the exception is occurring.
    
    **Instance 1:**
    
    ```
    17/11/10 15:49:32 ERROR util.Utils: Uncaught exception in thread 
driver-revive-thread
    org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
            at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
            at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
            at 
org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187)
            at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:521)
            at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(CoarseGrainedSchedulerBackend.scala:125)
            at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(CoarseGrainedSchedulerBackend.scala:125)
            at scala.Option.foreach(Option.scala:257)
            at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anon$1$$anonfun$run$1.apply$mcV$sp(CoarseGrainedSchedulerBackend.scala:125)
            at 
org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1344)
            at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anon$1.run(CoarseGrainedSchedulerBackend.scala:124)
            at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
            at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
            at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    ```
    
    
    In CoarseGrainedSchedulerBackend.scala, driver-revive-thread starts with 
DriverEndpoint.onStart() and keeps sending the ReviveOffers messages 
periodically till it gets shutdown as part DriverEndpoint.onStop(). There is no 
proper coordination between the driver-revive-thread(shutdown) and the 
RpcEndpoint unregister, RpcEndpoint unregister happens first and then 
driver-revive-thread shuts down as part of DriverEndpoint.onStop(), In-between 
driver-revive-thread may try to send the ReviveOffers message which is leading 
to the above exception.
    
    To fix this issue, this PR moves the shutting down of driver-revive-thread 
to CoarseGrainedSchedulerBackend.stop() which executes before the 
DriverEndpoint unregister.
    
    **Instance 2:**
    
    ```
    17/11/10 16:31:38 ERROR cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
Error requesting driver to remove executor 1 for reason Executor for container 
container_1508535467865_0226_01_000002 exited because of a YARN event (e.g., 
pre-emption) and not because of an error in the running job.
    org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
            at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
            at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
            at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
            at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:516)
            at org.apache.spark.rpc.RpcEndpointRef.ask(RpcEndpointRef.scala:63)
            at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receive$1.applyOrElse(YarnSchedulerBackend.scala:269)
            at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
            at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
            at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
            at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    ```
    
    Here YarnDriverEndpoint tries to send remove executor messages after the 
Yarn scheduler backend service stop, which is leading to the above exception. 
To avoid the above exception, 
    1) We may add a condition(which checks whether service has stopped or not) 
before sending executor remove message
    2) Add a warn log message in onFailure case when the service is already 
stopped
    
    In this PR, chosen the 2) option which adds a log message in the case of 
onFailure without the exception stack trace since the option 1) would need to 
to go through for every remove executor message.
    
    
    ## How was this patch tested?
    I verified it manually, I don't see these exceptions with the PR changes.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/devaraj-kavali/spark SPARK-14228

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19741.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19741
    
----
commit 4184efa17eb3959168caedc43b4058c0bed92083
Author: Devaraj K <[email protected]>
Date:   2017-11-14T02:41:37Z

    [SPARK-14228][CORE][YARN] Lost executor of RPC disassociated, and occurs
    exception: Could not find CoarseGrainedScheduler or it has been stopped

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19741: [SPARK-14228][CORE][YARN] Lost executor of RPC di...

Reply via email to