[ 
https://issues.apache.org/jira/browse/SPARK-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subroto Sanyal updated SPARK-15937:
-----------------------------------
    Description: 
h5. Problem:
Spark Job fails in yarn-cluster mode if the job takes less time than 10 
seconds. The job execution here is successful but, spark framework declares it 
failed.
{noformat}
16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1465791692084_0078_000001
16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(subroto); users 
with modify permissions: Set(subroto)
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry 
on cluster...
16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of 
US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.big-decimal.precision=32
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.default-timezone=US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from 
US/Eastern to US/Eastern
16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit 
Server VM, 1.7 (Oracle Corporation)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m 
-Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp
 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001
 -XX:MaxPermSize=256m 
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:  
'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties'
 (default classpath)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before 
Plugin Registry start-up: 847.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task 
configuration...
16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop 
task to DEBUG: 
16/06/13 10:50:35 INFO cluster.JobProcessor: Processing 
JobInput{_jobName=Import job (76): 
BookorderHS2ImportJob_SparkCluster#import(Identity)}
16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to 
hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
16/06/13 10:50:35 DEBUG hdfs.DFSClient: 
/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: 
masked=rw-r--r--
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #2
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #2
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: 
src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
chunkSize=516, chunksPerPacket=127, packetSize=65532
16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for 
[DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new 
packet seqno=0, 
src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #3
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #3
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode 
10.126.43.144:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block 
BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet 
packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 
330
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS 
status: SUCCESS downstreamAckTimeNanos: 510884
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block 
BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet 
packet seqno:1 offsetInBlock:330 lastPacketInBlock:true lastByteOffsetInBlock: 
330
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS 
status: SUCCESS downstreamAckTimeNanos: 1449877
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #4
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #4
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize 
after waiting for 500000 ms. Please check earlier log output for errors. 
Failing the application.
16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with SUCCEEDED
16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory 
.sparkStaging/application_1465791692084_0078
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #5
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #5
16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: 
closed
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: 
stopped, remaining connections 0
{noformat}

h5. Root Cause
Race condition in the logic of method  
_org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_
 to figure out if the SparkContext is available or not is not sufficient and it 
doesn't respect the fact job can be finished within 10 seconds.
>From the above log snippet (log of App Master) we can see that user class is 
>done executing and marks as finished with App status as _SUCCEEDED_ and 
>exitCode as _0_ but, immediately after that there is an error mentioning that 
>SparkContext was't initialised after 500 seconds(the job here actually takes 
>little less than 5 seconds to complete)

  was:
h5. Problem:
Spark Job fails in yarn-cluster mode if the job takes less time than 10 
seconds. The job execution here is successful but, spark framework declares it 
failed.
{noformat}
16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1465791692084_0078_000001
16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(subroto); users 
with modify permissions: Set(subroto)
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry 
on cluster...
16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of 
US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.big-decimal.precision=32
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.default-timezone=US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from 
US/Eastern to US/Eastern
16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit 
Server VM, 1.7 (Oracle Corporation)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m 
-Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp
 
-Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001
 -XX:MaxPermSize=256m 
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:  
'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties'
 (default classpath)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before 
Plugin Registry start-up: 847.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task 
configuration...
16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop 
task to DEBUG: 
16/06/13 10:50:35 INFO cluster.JobProcessor: Processing 
JobInput{_jobName=Import job (76): 
BookorderHS2ImportJob_SparkCluster#import(Identity)}
16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to 
hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
16/06/13 10:50:35 DEBUG hdfs.DFSClient: 
/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: 
masked=rw-r--r--
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #2
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #2
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: 
src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
chunkSize=516, chunksPerPacket=127, packetSize=65532
16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for 
[DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new 
packet seqno=0, 
src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #3
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #3
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode 
10.126.43.144:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block 
BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet 
packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 
330
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS 
status: SUCCESS downstreamAckTimeNanos: 510884
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block 
BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet 
packet seqno:1 offsetInBlock:330 lastPacketInBlock:true lastByteOffsetInBlock: 
330
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS 
status: SUCCESS downstreamAckTimeNanos: 1449877
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #4
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #4
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize 
after waiting for 500000 ms. Please check earlier log output for errors. 
Failing the application.
16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with SUCCEEDED
16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory 
.sparkStaging/application_1465791692084_0078
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
sending #5
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
value #5
16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: 
closed
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: 
stopped, remaining connections 0
{noformat}

h5. Root Cause
Race condition in the logic of method  
_org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_
 to figure out if the SparkContext is available or not is not sufficient and it 
doesn't respect the fact job can be finished within 10 seconds.
>From the above log snippet (log of App Master) we can see that user class is 
>done executing and marks as finished with Spp status as _SUCCEEDED_ and 
>exitCode as _0_ but, immediately after that there is an error mentioning that 
>SparkContext was't initialised after 500 seconds(the job here actually takes 
>little less than 5 seconds to complete)


> Spark declares a succeeding job to be failed in yarn-cluster mode if the job 
> takes very small time (~ < 10 seconds) to finish
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15937
>                 URL: https://issues.apache.org/jira/browse/SPARK-15937
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: Subroto Sanyal
>            Priority: Critical
>
> h5. Problem:
> Spark Job fails in yarn-cluster mode if the job takes less time than 10 
> seconds. The job execution here is successful but, spark framework declares 
> it failed.
> {noformat}
> 16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for 
> [TERM, HUP, INT]
> 16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
> appattempt_1465791692084_0078_000001
> 16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
> 16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
> 16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(subroto); users 
> with modify permissions: Set(subroto)
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application 
> in a separate Thread
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization ... 
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin 
> registry on cluster...
> 16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of 
> US/Eastern
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
> das.big-decimal.precision=32
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
> das.default-timezone=US/Eastern
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property 
> das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
> 16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of 
> from US/Eastern to US/Eastern
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit 
> Server VM, 1.7 (Oracle Corporation)
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m 
> -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp
>  
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001
>  -XX:MaxPermSize=256m 
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:  
> 'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties'
>  (default classpath)
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before 
> Plugin Registry start-up: 847.5 MB
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
> 16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task 
> configuration...
> 16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop 
> task to DEBUG: 
> 16/06/13 10:50:35 INFO cluster.JobProcessor: Processing 
> JobInput{_jobName=Import job (76): 
> BookorderHS2ImportJob_SparkCluster#import(Identity)}
> 16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
> 16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to 
> hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: 
> /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: 
> masked=rw-r--r--
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
> sending #2
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
> value #2
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: 
> src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
> chunkSize=516, chunksPerPacket=127, packetSize=65532
> 16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for 
> [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new 
> packet seqno=0, 
> src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, 
> packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
> sending #3
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
> value #3
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode 
> 10.126.43.144:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block 
> BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet 
> packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 
> 330
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS 
> status: SUCCESS downstreamAckTimeNanos: 510884
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block 
> BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet 
> packet seqno:1 offsetInBlock:330 lastPacketInBlock:true 
> lastByteOffsetInBlock: 330
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS 
> status: SUCCESS downstreamAckTimeNanos: 1449877
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
> sending #4
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
> value #4
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
> 16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
> 16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not 
> initialize after waiting for 500000 ms. Please check earlier log output for 
> errors. Failing the application.
> 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED
> 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1465791692084_0078
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto 
> sending #5
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got 
> value #5
> 16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
> 16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
> 16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: 
> closed
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to 
> ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: 
> stopped, remaining connections 0
> {noformat}
> h5. Root Cause
> Race condition in the logic of method  
> _org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_
>  to figure out if the SparkContext is available or not is not sufficient and 
> it doesn't respect the fact job can be finished within 10 seconds.
> From the above log snippet (log of App Master) we can see that user class is 
> done executing and marks as finished with App status as _SUCCEEDED_ and 
> exitCode as _0_ but, immediately after that there is an error mentioning that 
> SparkContext was't initialised after 500 seconds(the job here actually takes 
> little less than 5 seconds to complete)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to