[jira] [Updated] (SPARK-16244) Failed job/stage couldn't stop JobGenerator immediately.

Liyin Tang (JIRA) Mon, 27 Jun 2016 21:32:19 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Liyin Tang updated SPARK-16244:
-------------------------------
    Description: 
This streaming job has a very simple DAG. Each batch have only 1 job, and each 
job has only 1 stage.

Based on the following logs, we observed a potential race condition. Stage 1 
failed due to some tasks failure, and it tigers failJobAndIndependentStages.

In the meanwhile, the next stage (job), 2, is submitted and was able to 
successfully run a few tasks before stopping JobGenerator via shutdown hook.

Since the next job was able to run through a few tasks successfully, it just 
messed up all the checkpoints / offset management.

Here is the log from my job:


{color:red}
Stage 227 started:
{color}
[INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo 
- Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at 
foreachRDD at DBExportStreaming.java:55)
[INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo 
- Final stage: ResultStage 227(foreachRDD at DBExportStreaming.java:55)
[INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo 
- Submitting ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at 
DBExportStreaming.java:55), which has no missing parents
[INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo 
- Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at 
foreachRDD at DBExportStreaming.java:55)

{color:red}
Stage 227 failed:
{color}
[ERROR] 2016-06-25 19:01:34,083 org.apache.spark.scheduler.TaskSetManager 
logError - Task 26 in stage 227.0 failed 4 times; aborting job
[INFO] 2016-06-25 19:01:34,086 org.apache.spark.scheduler.cluster.YarnScheduler 
logInfo - Cancelling stage 227
[INFO] 2016-06-25 19:01:34,088 org.apache.spark.scheduler.cluster.YarnScheduler 
logInfo - Stage 227 was cancelled
[INFO] 2016-06-25 19:01:34,089 org.apache.spark.scheduler.DAGScheduler logInfo 
- ResultStage 227 (foreachRDD at DBExportStreaming.java:55) failed in 153.914 s
[INFO] 2016-06-25 19:01:34,090 org.apache.spark.scheduler.DAGScheduler logInfo 
- Job 227 failed: foreachRDD at DBExportStreaming.java:55, took 153.930462 s
[INFO] 2016-06-25 19:01:34,091 
org.apache.spark.streaming.scheduler.JobScheduler logInfo - Finished job 
streaming job 1466881140000 ms.0 from job set of time 14
66881140000 ms
[INFO] 2016-06-25 19:01:34,091 
org.apache.spark.streaming.scheduler.JobScheduler logInfo - Total delay: 
154.091 s for time 1466881140000 ms (execution: 153.935
s)

{color:red}
Stage 228 started:
{color}

[INFO] 2016-06-25 19:01:34,094 org.apache.spark.SparkContext logInfo - Starting 
job: foreachRDD at DBExportStreaming.java:55
[INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo 
- Got job 228 (foreachRDD at DBExportStreaming.java:55) with 1495 output 
partitions
[INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo 
- Final stage: ResultStage 228(foreachRDD at DBExportStreaming.java:55)
Exception in thread "main" [INFO] 2016-06-25 19:01:34,095 
org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: List()

{color:red}
Shutdown hook was called after stage 228 started:
{color}

[INFO] 2016-06-25 19:01:34,099 org.apache.spark.streaming.StreamingContext 
logInfo - Invoking stop(stopGracefully=false) from shutdown hook
[INFO] 2016-06-25 19:01:34,101 
org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopping 
JobGenerator immediately
[INFO] 2016-06-25 19:01:34,102 org.apache.spark.streaming.util.RecurringTimer 
logInfo - Stopped timer for JobGenerator after time 1466881260000
[INFO] 2016-06-25 19:01:34,103 
org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopped JobGenerator
[INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - 
ensureFreeSpace(133720) called with curMem=344903, maxMem=1159641169
[INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - 
Block broadcast_229 stored as values in memory (estimated size 130.6 KB, free 
1105.5 MB)
[INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - 
ensureFreeSpace(51478) called with curMem=478623, maxMem=1159641169
[INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - 
Block broadcast_229_piece0 stored as bytes in memory (estimated size 50.3 KB, 
free 1105.4 MB)
[INFO] 2016-06-25 19:01:34,108 org.apache.spark.storage.BlockManagerInfo 
logInfo - Added broadcast_229_piece0 in memory on 10.123.209.8:42154 (size: 
50.3 KB, free: 1105.8 MB)
[INFO] 2016-06-25 19:01:34,109 org.apache.spark.SparkContext logInfo - Created 
broadcast 229 from broadcast at DAGScheduler.scala:861
[INFO] 2016-06-25 19:01:34,110 org.apache.spark.scheduler.DAGScheduler logInfo 
- Submitting 1495 missing tasks from ResultStage 228 (MapPartitionsRDD[458] at 
foreachRDD at DBExportStreaming.java:55)
[INFO] 2016-06-25 19:01:34,111 org.apache.spark.scheduler.cluster.YarnScheduler 
logInfo - Adding task set 228.0 with 1495 tasks
[INFO] 2016-06-25 19:01:34,329 org.apache.spark.scheduler.TaskSetManager 
logInfo - Starting task 0.0 in stage 228.0 (TID 340194, 
i-87adb702.inst.aws.airbnb.com, RACK_LOCAL, 2894 bytes)

{color:red}
Stage 228 has completed a few tasks even though the job is supposed to be 
killed:
{color}
[INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager 
logInfo - Starting task 2.0 in stage 228.0 (TID 340196, 
i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes)
[INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager 
logInfo - Finished task 1.0 in stage 228.0 (TID 340195) in 100 ms on 
i-fbc19629.inst.aws.airbnb.com (1/1495)
[INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager 
logInfo - Starting task 3.0 in stage 228.0 (TID 340197, 
i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes)
[INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager 
logInfo - Finished task 2.0 in stage 228.0 (TID 340196) in 19 ms on 
i-fbc19629.inst.aws.airbnb.com (2/1495)
[INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager 
logInfo - Starting task 4.0 in stage 228.0 (TID 340198, 
i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2871 bytes)
[INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager 
logInfo - Finished task 3.0 in stage 228.0 (TID 340197) in 44 ms on 
i-fbc19629.inst.aws.airbnb.com (3/1495)



  was:
This streaming job has a very simple DAG. Each batch have only 1 job, and each 
job has only 1 stage.

Based on the following logs, we observed a potential race condition. Stage 1 
failed due to some tasks failure, and it tigers failJobAndIndependentStages.

In the meanwhile, the next stage (job), 2, is submitted and was able to 
successfully run a few tasks before stopping JobGenerator via shutdown hook.

Since the next job was able to run through a few tasks successfully, it just 
messed up all the checkpoints / offset management.

I will attach the log in the jira as well.




> Failed job/stage couldn't stop JobGenerator immediately.
> --------------------------------------------------------
>
>                 Key: SPARK-16244
>                 URL: https://issues.apache.org/jira/browse/SPARK-16244
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.5.2
>            Reporter: Liyin Tang
>
> This streaming job has a very simple DAG. Each batch have only 1 job, and 
> each job has only 1 stage.
> Based on the following logs, we observed a potential race condition. Stage 1 
> failed due to some tasks failure, and it tigers failJobAndIndependentStages.
> In the meanwhile, the next stage (job), 2, is submitted and was able to 
> successfully run a few tasks before stopping JobGenerator via shutdown hook.
> Since the next job was able to run through a few tasks successfully, it just 
> messed up all the checkpoints / offset management.
> Here is the log from my job:
> {color:red}
> Stage 227 started:
> {color}
> [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Submitting 1495 missing tasks from ResultStage 227 
> (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55)
> [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Final stage: ResultStage 227(foreachRDD at 
> DBExportStreaming.java:55)
> [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Submitting ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at 
> DBExportStreaming.java:55), which has no missing parents
> [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Submitting 1495 missing tasks from ResultStage 227 
> (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55)
> {color:red}
> Stage 227 failed:
> {color}
> [ERROR] 2016-06-25 19:01:34,083 org.apache.spark.scheduler.TaskSetManager 
> logError - Task 26 in stage 227.0 failed 4 times; aborting job
> [INFO] 2016-06-25 19:01:34,086 
> org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Cancelling stage 
> 227
> [INFO] 2016-06-25 19:01:34,088 
> org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Stage 227 was 
> cancelled
> [INFO] 2016-06-25 19:01:34,089 org.apache.spark.scheduler.DAGScheduler 
> logInfo - ResultStage 227 (foreachRDD at DBExportStreaming.java:55) failed in 
> 153.914 s
> [INFO] 2016-06-25 19:01:34,090 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Job 227 failed: foreachRDD at DBExportStreaming.java:55, took 
> 153.930462 s
> [INFO] 2016-06-25 19:01:34,091 
> org.apache.spark.streaming.scheduler.JobScheduler logInfo - Finished job 
> streaming job 1466881140000 ms.0 from job set of time 14
> 66881140000 ms
> [INFO] 2016-06-25 19:01:34,091 
> org.apache.spark.streaming.scheduler.JobScheduler logInfo - Total delay: 
> 154.091 s for time 1466881140000 ms (execution: 153.935
> s)
> {color:red}
> Stage 228 started:
> {color}
> [INFO] 2016-06-25 19:01:34,094 org.apache.spark.SparkContext logInfo - 
> Starting job: foreachRDD at DBExportStreaming.java:55
> [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Got job 228 (foreachRDD at DBExportStreaming.java:55) with 1495 
> output partitions
> [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Final stage: ResultStage 228(foreachRDD at 
> DBExportStreaming.java:55)
> Exception in thread "main" [INFO] 2016-06-25 19:01:34,095 
> org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: 
> List()
> {color:red}
> Shutdown hook was called after stage 228 started:
> {color}
> [INFO] 2016-06-25 19:01:34,099 org.apache.spark.streaming.StreamingContext 
> logInfo - Invoking stop(stopGracefully=false) from shutdown hook
> [INFO] 2016-06-25 19:01:34,101 
> org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopping 
> JobGenerator immediately
> [INFO] 2016-06-25 19:01:34,102 org.apache.spark.streaming.util.RecurringTimer 
> logInfo - Stopped timer for JobGenerator after time 1466881260000
> [INFO] 2016-06-25 19:01:34,103 
> org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopped 
> JobGenerator
> [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - 
> ensureFreeSpace(133720) called with curMem=344903, maxMem=1159641169
> [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - 
> Block broadcast_229 stored as values in memory (estimated size 130.6 KB, free 
> 1105.5 MB)
> [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - 
> ensureFreeSpace(51478) called with curMem=478623, maxMem=1159641169
> [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - 
> Block broadcast_229_piece0 stored as bytes in memory (estimated size 50.3 KB, 
> free 1105.4 MB)
> [INFO] 2016-06-25 19:01:34,108 org.apache.spark.storage.BlockManagerInfo 
> logInfo - Added broadcast_229_piece0 in memory on 10.123.209.8:42154 (size: 
> 50.3 KB, free: 1105.8 MB)
> [INFO] 2016-06-25 19:01:34,109 org.apache.spark.SparkContext logInfo - 
> Created broadcast 229 from broadcast at DAGScheduler.scala:861
> [INFO] 2016-06-25 19:01:34,110 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Submitting 1495 missing tasks from ResultStage 228 
> (MapPartitionsRDD[458] at foreachRDD at DBExportStreaming.java:55)
> [INFO] 2016-06-25 19:01:34,111 
> org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Adding task set 
> 228.0 with 1495 tasks
> [INFO] 2016-06-25 19:01:34,329 org.apache.spark.scheduler.TaskSetManager 
> logInfo - Starting task 0.0 in stage 228.0 (TID 340194, 
> i-87adb702.inst.aws.airbnb.com, RACK_LOCAL, 2894 bytes)
> {color:red}
> Stage 228 has completed a few tasks even though the job is supposed to be 
> killed:
> {color}
> [INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager 
> logInfo - Starting task 2.0 in stage 228.0 (TID 340196, 
> i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes)
> [INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager 
> logInfo - Finished task 1.0 in stage 228.0 (TID 340195) in 100 ms on 
> i-fbc19629.inst.aws.airbnb.com (1/1495)
> [INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager 
> logInfo - Starting task 3.0 in stage 228.0 (TID 340197, 
> i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes)
> [INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager 
> logInfo - Finished task 2.0 in stage 228.0 (TID 340196) in 19 ms on 
> i-fbc19629.inst.aws.airbnb.com (2/1495)
> [INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager 
> logInfo - Starting task 4.0 in stage 228.0 (TID 340198, 
> i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2871 bytes)
> [INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager 
> logInfo - Finished task 3.0 in stage 228.0 (TID 340197) in 44 ms on 
> i-fbc19629.inst.aws.airbnb.com (3/1495)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16244) Failed job/stage couldn't stop JobGenerator immediately.

Reply via email to