[
https://issues.apache.org/jira/browse/SPARK-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-15937.
-------------------------------
Resolution: Not A Problem
Per JIRA discussion
> Spark declares a succeeding job to be failed in yarn-cluster mode if the job
> takes very small time (~ < 10 seconds) to finish
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-15937
> URL: https://issues.apache.org/jira/browse/SPARK-15937
> Project: Spark
> Issue Type: Bug
> Affects Versions: 1.6.1
> Reporter: Subroto Sanyal
> Priority: Critical
>
> h5. Problem:
> Spark Job fails in yarn-cluster mode if the job takes less time than 10
> seconds. The job execution here is successful but, spark framework declares
> it failed.
> {noformat}
> 16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for
> [TERM, HUP, INT]
> 16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId:
> appattempt_1465791692084_0078_000001
> 16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
> 16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
> 16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(subroto); users
> with modify permissions: Set(subroto)
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application
> in a separate Thread
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization ...
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin
> registry on cluster...
> 16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of
> US/Eastern
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property
> das.big-decimal.precision=32
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property
> das.default-timezone=US/Eastern
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property
> das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
> 16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of
> from US/Eastern to US/Eastern
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit
> Server VM, 1.7 (Oracle Corporation)
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m
> -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp
>
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001
> -XX:MaxPermSize=256m
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:
> 'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties'
> (default classpath)
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before
> Plugin Registry start-up: 847.5 MB
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
> 16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task
> configuration...
> 16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop
> task to DEBUG:
> 16/06/13 10:50:35 INFO cluster.JobProcessor: Processing
> JobInput{_jobName=Import job (76):
> BookorderHS2ImportJob_SparkCluster#import(Identity)}
> 16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
> 16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to
> hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient:
> /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827:
> masked=rw-r--r--
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto
> sending #2
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got
> value #2
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize:
> src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827,
> chunkSize=516, chunksPerPacket=127, packetSize=65532
> 16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for
> [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new
> packet seqno=0,
> src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827,
> packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto
> sending #3
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got
> value #3
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode
> 10.126.43.144:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block
> BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet
> packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock:
> 330
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS
> status: SUCCESS downstreamAckTimeNanos: 510884
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block
> BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet
> packet seqno:1 offsetInBlock:330 lastPacketInBlock:true
> lastByteOffsetInBlock: 330
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS
> status: SUCCESS downstreamAckTimeNanos: 1449877
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto
> sending #4
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got
> value #4
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
> 16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED,
> exitCode: 0
> 16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
> 16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not
> initialize after waiting for 500000 ms. Please check earlier log output for
> errors. Failing the application.
> 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering
> ApplicationMaster with SUCCEEDED
> 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory
> .sparkStaging/application_1465791692084_0078
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto
> sending #5
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got
> value #5
> 16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
> 16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
> 16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto:
> closed
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto:
> stopped, remaining connections 0
> {noformat}
> h5. Root Cause
> Race condition in the logic of method
> _org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_
> to figure out if the SparkContext is available or not is not sufficient and
> it doesn't respect the fact job can be finished within 10 seconds.
> From the above log snippet (log of App Master) we can see that user class is
> done executing and marks as finished with App status as _SUCCEEDED_ and
> exitCode as _0_ but, immediately after that there is an error mentioning that
> SparkContext was't initialised after 500 seconds(the job here actually takes
> little less than 5 seconds to complete)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]