[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Gesher updated SPARK-19764:
-------------------------------
    Description: 
We've come across a job that won't finish.  Running on a six-node cluster, each 
of the executors end up with 5-7 tasks that are never marked as completed.

Here's an excerpt from the web UI:

||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
Size / Records||Errors||
|105    | 1131  | 0     | SUCCESS       |PROCESS_LOCAL  |4 / 172.31.24.171 |    
2017/02/27 22:51:36 |   1.9 min |       9 ms |  4 ms |  0.7 s | 2 ms|   6 ms|   
384.1 MB|       90.3 MB / 572   | |
|106|   1168|   0|      RUNNING |ANY|   2 / 172.31.16.112|      2017/02/27 
22:53:25|    6.5 h   |0 ms|  0 ms|   1 s     |0 ms|  0 ms|   |384.1 MB       
|98.7 MB / 624 | |      

However, the Executor reports the task as finished: 
{noformat}
17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
2633558 bytes result sent via BlockManager)
{noformat}

As does the driver log:
{noformat}
17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
2633558 bytes result sent via BlockManager)
{noformat}

Full log from this executor and the {{stderr}} from 
{{app-20170227223614-0001/2/stderr}} attached.


  was:
We've come across a job that won't finish.  Running on a six-node cluster, each 
of the executors end up with 5-7 tasks that are never marked as completed.

Here's an excerpt from the web UI:

||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
Size / Records||Errors||
|105    | 1131  | 0     | SUCCESS       |PROCESS_LOCAL  |4 / 172.31.24.171 |    
2017/02/27 22:51:36 |   1.9 min |       9 ms |  4 ms |  0.7 s | 2 ms|   6 ms|   
384.1 MB|       90.3 MB / 572   | |
|106|   1168|   0|      RUNNING |ANY|   2 / 172.31.16.112|      2017/02/27 
22:53:25|    6.5 h   |0 ms|  0 ms|   1 s     |0 ms|  0 ms|   |384.1 MB       
|98.7 MB / 624 | |      

However, the Executor reports the task as finished: 
{noformat}
17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
2633558 bytes result sent via BlockManager)
{noformat}


Full log from this executor attached.



> Executors hang with supposedly running task that are really finished.
> ---------------------------------------------------------------------
>
>                 Key: SPARK-19764
>                 URL: https://issues.apache.org/jira/browse/SPARK-19764
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 2.0.2
>         Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>            Reporter: Ari Gesher
>         Attachments: driver-log-stderr.log, executor-2.log
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0     | SUCCESS       |PROCESS_LOCAL  |4 / 172.31.24.171 |    
> 2017/02/27 22:51:36 |   1.9 min |       9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|       90.3 MB / 572   | |
> |106| 1168|   0|      RUNNING |ANY|   2 / 172.31.16.112|      2017/02/27 
> 22:53:25|    6.5 h   |0 ms|  0 ms|   1 s     |0 ms|  0 ms|   |384.1 MB       
> |98.7 MB / 624 | |      
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to