Subroto Sanyal created SPARK-15937:
--------------------------------------
Summary: Spark declares a succeeding job to be failed in
yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish
Key: SPARK-15937
URL: https://issues.apache.org/jira/browse/SPARK-15937
Project: Spark
Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Subroto Sanyal
h5. Problem:
Spark Job fails in yarn-cluster mode if the job takes less time than 10
seconds. The job execution here is successful but, spark framework declares it
failed.
{noformat}
16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for
[TERM, HUP, INT]
16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId:
appattempt_1465791692084_0078_000001
16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(subroto); users
with modify permissions: Set(subroto)
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in
a separate Thread
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context
initialization
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context
initialization ...
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry
on cluster...
16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of
US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property
das.big-decimal.precision=32
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property
das.default-timezone=US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property
das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from
US/Eastern to US/Eastern
16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit
Server VM, 1.7 (Oracle Corporation)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m
-Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp
-Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001
-XX:MaxPermSize=256m
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:
'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties'
(default classpath)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before
Plugin Registry start-up: 847.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task
configuration...
16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop
task to DEBUG:
16/06/13 10:50:35 INFO cluster.JobProcessor: Processing
JobInput{_jobName=Import job (76):
BookorderHS2ImportJob_SparkCluster#import(Identity)}
16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to
hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
16/06/13 10:50:35 DEBUG hdfs.DFSClient:
/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827:
masked=rw-r--r--
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto
sending #2
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got
value #2
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize:
src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827,
chunkSize=516, chunksPerPacket=127, packetSize=65532
16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for
[DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new
packet seqno=0,
src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827,
packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto
sending #3
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got
value #3
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode
10.126.43.144:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block
BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet
packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock:
330
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS
status: SUCCESS downstreamAckTimeNanos: 510884
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block
BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet
packet seqno:1 offsetInBlock:330 lastPacketInBlock:true lastByteOffsetInBlock:
330
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS
status: SUCCESS downstreamAckTimeNanos: 1449877
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto
sending #4
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got
value #4
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED,
exitCode: 0
16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize
after waiting for 500000 ms. Please check earlier log output for errors.
Failing the application.
16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster
with SUCCEEDED
16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory
.sparkStaging/application_1465791692084_0078
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto
sending #5
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got
value #5
16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto:
closed
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto:
stopped, remaining connections 0
{noformat}
h5. Root Cause
Race condition in the logic of method
_org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_
to figure out if the SparkContext is available or not is not sufficient and it
doesn't respect the fact job can be finished within 10 seconds.
>From the above log snippet (log of App Master) we can see that user class is
>done executing and marks as finished with Spp status as _SUCCEEDED_ and
>exitCode as _0_ but, immediately after that there is an error mentioning that
>SparkContext was't initialised after 500 seconds(the job here actually takes
>little less than 5 seconds to complete)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]