Tobias Bertelsen created SPARK-6307:
---------------------------------------

             Summary: Executers fetches the same rdd-block 100's or 1000's of 
times
                 Key: SPARK-6307
                 URL: https://issues.apache.org/jira/browse/SPARK-6307
             Project: Spark
          Issue Type: Bug
    Affects Versions: 2+
         Environment: Linux, Spark Standalone 2.10, running in a PBS grid engine
            Reporter: Tobias Bertelsen


The block manager keept fetching the same blocks over and over, making tasks 
with network activity extremely slow. Two identical tasks can take between 12 
seconds up to more than an hour. (where I stopped it).

Spark should cache the blocks, so it does not fetch the same blocks over, and 
over, and over.

Here is a simplified version of the code that provokes it:

{code}
// Read a few thousand lines (~ 15 MB)
val fileContents = sc.newAPIHadoopFile(path, ......).repartition(16)
val data = fileContents.map{x => parseContent(x)}.cache()
// Do a pairwise comparison and count the best pairs
val pairs = data.cartesian(data).filter { case ((x,y) =>
  similarity(x, y) > 0.9
}
pairs.count()
{code}

This is a tiny fraction of one of the worker's stderr:

{code}
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_1 remotely
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_0 remotely

Thousands more lines, fetching the same 16 remote blocks

15/03/12 22:25:44 INFO BlockManager: Found block rdd_8_0 remotely
15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
{code}

h2. Details for that stage from the UI.

 - *Total task time across all tasks:* 11.9 h
 - *Input:* 2.2 GB
 - *Shuffle read:* 4.5 MB


h3. Summary Metrics for 176 Completed Tasks

|| Metric || Min || 25th percentile || Median || 75th percentile || Max ||
| Duration | 7 s | 8 s | 8 s | 12 s | 59 min |
| GC Time | 0 ms | 99 ms | 0.1 s | 0.2 s | 0.5 s |
| Input | 6.9 MB | 8.2 MB | 8.4 MB | 9.0 MB | 11.0 MB |
| Shuffle Read (Remote) | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 676.6 KB |



h3. Aggregated Metrics by Executor

|| Executor ID || Address || Task Time || Total Tasks || Failed Tasks || 
Succeeded Tasks || Input || Output || Shuffle Read || Shuffle Write || Shuffle 
Spill (Memory) || Shuffle Spill (Disk) ||
| 0 | n-62-23-3:49566 | 5.7 h | 9 | 0 | 9 | 171.0 MB | 0.0 B | 0.0 B | 0.0 B | 
0.0 B | 0.0 B |
| 1 | n-62-23-6:57518 | 16.4 h | 20 | 0 | 20 | 169.9 MB | 0.0 B | 0.0 B | 0.0 B 
| 0.0 B | 0.0 B |
| 2 | n-62-18-48:33551 | 0 ms | 0 | 0 | 0 | 169.6 MB | 0.0 B | 0.0 B | 0.0 B | 
0.0 B | 0.0 B |
| 3 | n-62-23-5:58421 | 2.9 min | 12 | 0 | 12 | 266.2 MB | 0.0 B | 4.5 MB | 0.0 
B | 0.0 B | 0.0 B |
| 4 | n-62-23-1:40096 | 23 min | 164 | 0 | 164 | 1430.4 MB | 0.0 B | 0.0 B | 
0.0 B | 0.0 B | 0.0 B |




h3. Tasks

|| Index || ID || Attempt || Status || Locality Level || Executor ID / Host || 
Launch Time || Duration || GC Time || Input || Shuffle Read || Errors ||
| 1 | 2 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 0.1 
s | 6.9 MB (memory) | 676.6 KB |    | 
| 0 | 1 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 
0.3 s | 8.7 MB (network) | 0.0 B |    | 
| 4 | 5 | 0 | SUCCESS | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 38 min | 
0.4 s | 8.6 MB (network) | 0.0 B |    | 
| 3 | 4 | 0 | RUNNING | ANY | 2 / n-62-18-48 | 2015/03/12 21:55:00 | 55 min |  
| 8.3 MB (network) | 0.0 B |    | 
| 2 | 3 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 11 s | 0.3 
s | 8.4 MB (memory) | 0.0 B |    | 
| 7 | 8 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 12 s | 0.3 
s | 9.2 MB (memory) | 0.0 B |    | 
| 6 | 7 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 0.1 
s | 8.1 MB (memory) | 0.0 B |    | 
| 5 | 6 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 
0.3 s | 8.6 MB (network) | 0.0 B |    | 
| 9 | 10 | 0 | RUNNING | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 55 min |  
| 8.7 MB (network) | 0.0 B |    | 










--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to