[ https://issues.apache.org/jira/browse/SPARK-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liyin Tang updated SPARK-16244: ------------------------------- Description: This streaming job has a very simple DAG. Each batch have only 1 job, and each job has only 1 stage. Based on the following logs, we observed a potential race condition. Stage 1 failed due to some tasks failure, and it tigers failJobAndIndependentStages. In the meanwhile, the next stage (job), 2, is submitted and was able to successfully run a few tasks before stopping JobGenerator via shutdown hook. Since the next job was able to run through a few tasks successfully, it just messed up all the checkpoints / offset management. Here is the log from my job: {color:red} Stage 227 started: {color} [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55) [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 227(foreachRDD at DBExportStreaming.java:55) [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55), which has no missing parents [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55) {color:red} Stage 227 failed: {color} [ERROR] 2016-06-25 19:01:34,083 org.apache.spark.scheduler.TaskSetManager logError - Task 26 in stage 227.0 failed 4 times; aborting job [INFO] 2016-06-25 19:01:34,086 org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Cancelling stage 227 [INFO] 2016-06-25 19:01:34,088 org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Stage 227 was cancelled [INFO] 2016-06-25 19:01:34,089 org.apache.spark.scheduler.DAGScheduler logInfo - ResultStage 227 (foreachRDD at DBExportStreaming.java:55) failed in 153.914 s [INFO] 2016-06-25 19:01:34,090 org.apache.spark.scheduler.DAGScheduler logInfo - Job 227 failed: foreachRDD at DBExportStreaming.java:55, took 153.930462 s [INFO] 2016-06-25 19:01:34,091 org.apache.spark.streaming.scheduler.JobScheduler logInfo - Finished job streaming job 1466881140000 ms.0 from job set of time 14 66881140000 ms [INFO] 2016-06-25 19:01:34,091 org.apache.spark.streaming.scheduler.JobScheduler logInfo - Total delay: 154.091 s for time 1466881140000 ms (execution: 153.935 s) {color:red} Stage 228 started: {color} [INFO] 2016-06-25 19:01:34,094 org.apache.spark.SparkContext logInfo - Starting job: foreachRDD at DBExportStreaming.java:55 [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Got job 228 (foreachRDD at DBExportStreaming.java:55) with 1495 output partitions [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 228(foreachRDD at DBExportStreaming.java:55) Exception in thread "main" [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: List() {color:red} Shutdown hook was called after stage 228 started: {color} [INFO] 2016-06-25 19:01:34,099 org.apache.spark.streaming.StreamingContext logInfo - Invoking stop(stopGracefully=false) from shutdown hook [INFO] 2016-06-25 19:01:34,101 org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopping JobGenerator immediately [INFO] 2016-06-25 19:01:34,102 org.apache.spark.streaming.util.RecurringTimer logInfo - Stopped timer for JobGenerator after time 1466881260000 [INFO] 2016-06-25 19:01:34,103 org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopped JobGenerator [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - ensureFreeSpace(133720) called with curMem=344903, maxMem=1159641169 [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - Block broadcast_229 stored as values in memory (estimated size 130.6 KB, free 1105.5 MB) [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - ensureFreeSpace(51478) called with curMem=478623, maxMem=1159641169 [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - Block broadcast_229_piece0 stored as bytes in memory (estimated size 50.3 KB, free 1105.4 MB) [INFO] 2016-06-25 19:01:34,108 org.apache.spark.storage.BlockManagerInfo logInfo - Added broadcast_229_piece0 in memory on 10.123.209.8:42154 (size: 50.3 KB, free: 1105.8 MB) [INFO] 2016-06-25 19:01:34,109 org.apache.spark.SparkContext logInfo - Created broadcast 229 from broadcast at DAGScheduler.scala:861 [INFO] 2016-06-25 19:01:34,110 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 228 (MapPartitionsRDD[458] at foreachRDD at DBExportStreaming.java:55) [INFO] 2016-06-25 19:01:34,111 org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Adding task set 228.0 with 1495 tasks [INFO] 2016-06-25 19:01:34,329 org.apache.spark.scheduler.TaskSetManager logInfo - Starting task 0.0 in stage 228.0 (TID 340194, i-87adb702.inst.aws.airbnb.com, RACK_LOCAL, 2894 bytes) {color:red} Stage 228 has completed a few tasks even though the job is supposed to be killed: {color} [INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager logInfo - Starting task 2.0 in stage 228.0 (TID 340196, i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes) [INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 1.0 in stage 228.0 (TID 340195) in 100 ms on i-fbc19629.inst.aws.airbnb.com (1/1495) [INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager logInfo - Starting task 3.0 in stage 228.0 (TID 340197, i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes) [INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 2.0 in stage 228.0 (TID 340196) in 19 ms on i-fbc19629.inst.aws.airbnb.com (2/1495) [INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager logInfo - Starting task 4.0 in stage 228.0 (TID 340198, i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2871 bytes) [INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 3.0 in stage 228.0 (TID 340197) in 44 ms on i-fbc19629.inst.aws.airbnb.com (3/1495) was: This streaming job has a very simple DAG. Each batch have only 1 job, and each job has only 1 stage. Based on the following logs, we observed a potential race condition. Stage 1 failed due to some tasks failure, and it tigers failJobAndIndependentStages. In the meanwhile, the next stage (job), 2, is submitted and was able to successfully run a few tasks before stopping JobGenerator via shutdown hook. Since the next job was able to run through a few tasks successfully, it just messed up all the checkpoints / offset management. I will attach the log in the jira as well. > Failed job/stage couldn't stop JobGenerator immediately. > -------------------------------------------------------- > > Key: SPARK-16244 > URL: https://issues.apache.org/jira/browse/SPARK-16244 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.5.2 > Reporter: Liyin Tang > > This streaming job has a very simple DAG. Each batch have only 1 job, and > each job has only 1 stage. > Based on the following logs, we observed a potential race condition. Stage 1 > failed due to some tasks failure, and it tigers failJobAndIndependentStages. > In the meanwhile, the next stage (job), 2, is submitted and was able to > successfully run a few tasks before stopping JobGenerator via shutdown hook. > Since the next job was able to run through a few tasks successfully, it just > messed up all the checkpoints / offset management. > Here is the log from my job: > {color:red} > Stage 227 started: > {color} > [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler > logInfo - Submitting 1495 missing tasks from ResultStage 227 > (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55) > [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler > logInfo - Final stage: ResultStage 227(foreachRDD at > DBExportStreaming.java:55) > [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler > logInfo - Submitting ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at > DBExportStreaming.java:55), which has no missing parents > [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler > logInfo - Submitting 1495 missing tasks from ResultStage 227 > (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55) > {color:red} > Stage 227 failed: > {color} > [ERROR] 2016-06-25 19:01:34,083 org.apache.spark.scheduler.TaskSetManager > logError - Task 26 in stage 227.0 failed 4 times; aborting job > [INFO] 2016-06-25 19:01:34,086 > org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Cancelling stage > 227 > [INFO] 2016-06-25 19:01:34,088 > org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Stage 227 was > cancelled > [INFO] 2016-06-25 19:01:34,089 org.apache.spark.scheduler.DAGScheduler > logInfo - ResultStage 227 (foreachRDD at DBExportStreaming.java:55) failed in > 153.914 s > [INFO] 2016-06-25 19:01:34,090 org.apache.spark.scheduler.DAGScheduler > logInfo - Job 227 failed: foreachRDD at DBExportStreaming.java:55, took > 153.930462 s > [INFO] 2016-06-25 19:01:34,091 > org.apache.spark.streaming.scheduler.JobScheduler logInfo - Finished job > streaming job 1466881140000 ms.0 from job set of time 14 > 66881140000 ms > [INFO] 2016-06-25 19:01:34,091 > org.apache.spark.streaming.scheduler.JobScheduler logInfo - Total delay: > 154.091 s for time 1466881140000 ms (execution: 153.935 > s) > {color:red} > Stage 228 started: > {color} > [INFO] 2016-06-25 19:01:34,094 org.apache.spark.SparkContext logInfo - > Starting job: foreachRDD at DBExportStreaming.java:55 > [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler > logInfo - Got job 228 (foreachRDD at DBExportStreaming.java:55) with 1495 > output partitions > [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler > logInfo - Final stage: ResultStage 228(foreachRDD at > DBExportStreaming.java:55) > Exception in thread "main" [INFO] 2016-06-25 19:01:34,095 > org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: > List() > {color:red} > Shutdown hook was called after stage 228 started: > {color} > [INFO] 2016-06-25 19:01:34,099 org.apache.spark.streaming.StreamingContext > logInfo - Invoking stop(stopGracefully=false) from shutdown hook > [INFO] 2016-06-25 19:01:34,101 > org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopping > JobGenerator immediately > [INFO] 2016-06-25 19:01:34,102 org.apache.spark.streaming.util.RecurringTimer > logInfo - Stopped timer for JobGenerator after time 1466881260000 > [INFO] 2016-06-25 19:01:34,103 > org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopped > JobGenerator > [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - > ensureFreeSpace(133720) called with curMem=344903, maxMem=1159641169 > [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - > Block broadcast_229 stored as values in memory (estimated size 130.6 KB, free > 1105.5 MB) > [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - > ensureFreeSpace(51478) called with curMem=478623, maxMem=1159641169 > [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - > Block broadcast_229_piece0 stored as bytes in memory (estimated size 50.3 KB, > free 1105.4 MB) > [INFO] 2016-06-25 19:01:34,108 org.apache.spark.storage.BlockManagerInfo > logInfo - Added broadcast_229_piece0 in memory on 10.123.209.8:42154 (size: > 50.3 KB, free: 1105.8 MB) > [INFO] 2016-06-25 19:01:34,109 org.apache.spark.SparkContext logInfo - > Created broadcast 229 from broadcast at DAGScheduler.scala:861 > [INFO] 2016-06-25 19:01:34,110 org.apache.spark.scheduler.DAGScheduler > logInfo - Submitting 1495 missing tasks from ResultStage 228 > (MapPartitionsRDD[458] at foreachRDD at DBExportStreaming.java:55) > [INFO] 2016-06-25 19:01:34,111 > org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Adding task set > 228.0 with 1495 tasks > [INFO] 2016-06-25 19:01:34,329 org.apache.spark.scheduler.TaskSetManager > logInfo - Starting task 0.0 in stage 228.0 (TID 340194, > i-87adb702.inst.aws.airbnb.com, RACK_LOCAL, 2894 bytes) > {color:red} > Stage 228 has completed a few tasks even though the job is supposed to be > killed: > {color} > [INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager > logInfo - Starting task 2.0 in stage 228.0 (TID 340196, > i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes) > [INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager > logInfo - Finished task 1.0 in stage 228.0 (TID 340195) in 100 ms on > i-fbc19629.inst.aws.airbnb.com (1/1495) > [INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager > logInfo - Starting task 3.0 in stage 228.0 (TID 340197, > i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes) > [INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager > logInfo - Finished task 2.0 in stage 228.0 (TID 340196) in 19 ms on > i-fbc19629.inst.aws.airbnb.com (2/1495) > [INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager > logInfo - Starting task 4.0 in stage 228.0 (TID 340198, > i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2871 bytes) > [INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager > logInfo - Finished task 3.0 in stage 228.0 (TID 340197) in 44 ms on > i-fbc19629.inst.aws.airbnb.com (3/1495) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org