[
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saisai Shao updated SPARK-13642:
--------------------------------
Summary: Different finishing state between Spark and YARN (was:
Inconsistent finishing state between driver and AM )
> Different finishing state between Spark and YARN
> ------------------------------------------------
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.6.0
> Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default
> application final state is "SUCCEED", if any exception is occurred, this
> final state will be changed to "FAILED" and trigger the reattempt if
> possible.
> This is OK in normal case, but if there's a race condition when AM received a
> signal (SIGTERM) and no any exception is occurred. In this situation,
> shutdown hook will be invoked and marked this application as finished with
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed
> and need another attempt, but from Yarn's aspect the application is finished
> with success.
> This could happened in NM failure situation, the failure of NM will send
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(30000L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from
> Yarn's point this application is finished successfully, but from Spark's
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy :
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler:
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/stages/stage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/stages/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/stages,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/jobs/job/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/jobs/job,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/jobs/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped
> o.e.j.s.ServletContextHandler{/jobs,null}
> 16/03/03 12:44:39 INFO SparkUI: Stopped Spark web UI at
> http://192.168.0.105:57452
> 16/03/03 12:44:39 INFO YarnAllocator: Driver requested a total number of 0
> executor(s).
> 16/03/03 12:44:39 INFO YarnAllocator: Canceling requests for 0 executor
> containers
> 16/03/03 12:44:39 INFO YarnClusterSchedulerBackend: Shutting down all
> executors
> 16/03/03 12:44:39 WARN YarnAllocator: Expected to find pending requests, but
> found none.
> 16/03/03 12:44:39 INFO YarnClusterSchedulerBackend: Asking each executor to
> shut down
> 16/03/03 12:44:39 INFO SchedulerExtensionServices: Stopping
> SchedulerExtensionServices
> (serviceOption=None,
> services=List(),
> started=false)
> 16/03/03 12:44:39 INFO MapOutputTrackerMasterEndpoint:
> MapOutputTrackerMasterEndpoint stopped!
> 16/03/03 12:44:39 INFO MemoryStore: MemoryStore cleared
> 16/03/03 12:44:39 INFO BlockManager: BlockManager stopped
> 16/03/03 12:44:39 INFO BlockManagerMaster: BlockManagerMaster stopped
> 16/03/03 12:44:39 INFO
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 16/03/03 12:44:39 INFO SparkContext: Successfully stopped SparkContext
> 16/03/03 12:44:39 INFO ApplicationMaster: Final app status: SUCCEEDED,
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 16/03/03 12:44:39 INFO ApplicationMaster: Unregistering ApplicationMaster
> with SUCCEEDED (diag message: Shutdown hook called before final status was
> reported.)
> 16/03/03 12:44:39 INFO AMRMClientImpl: Waiting for application to be
> successfully unregistered.
> 16/03/03 12:44:39 INFO ApplicationMaster: Deleting staging directory
> .sparkStaging/application_1456975940304_0004
> 16/03/03 12:44:40 INFO ShutdownHookManager: Shutdown hook called
> {noformat}
> So basically, I think only after the finish of user class, we could mark this
> application as "SUCCESS", otherwise, especially in the signal stopped
> scenario, it would be better to mark as failed and try again (except
> explicitly KILL command by yarn).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]