[
https://issues.apache.org/jira/browse/SPARK-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562650#comment-14562650
]
Apache Spark commented on SPARK-6307:
-------------------------------------
User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6454
> Executers fetches the same rdd-block 100's or 1000's of times
> -------------------------------------------------------------
>
> Key: SPARK-6307
> URL: https://issues.apache.org/jira/browse/SPARK-6307
> Project: Spark
> Issue Type: Bug
> Components: Block Manager
> Affects Versions: 1.2.0
> Environment: Linux, Spark Standalone 1.2, running in a PBS grid engine
> Reporter: Tobias Bertelsen
>
> The block manager keept fetching the same blocks over and over, making tasks
> with network activity extremely slow. Two identical tasks can take between 12
> seconds up to more than an hour. (where I stopped it).
> Spark should cache the blocks, so it does not fetch the same blocks over, and
> over, and over.
> Here is a simplified version of the code that provokes it:
> {code}
> // Read a few thousand lines (~ 15 MB)
> val fileContents = sc.newAPIHadoopFile(path, ......).repartition(16)
> val data = fileContents.map{x => parseContent(x)}.cache()
> // Do a pairwise comparison and count the best pairs
> val pairs = data.cartesian(data).filter { case ((x,y) =>
> similarity(x, y) > 0.9
> }
> pairs.count()
> {code}
> This is a tiny fraction of one of the worker's stderr:
> {code}
> 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely
> 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely
> 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_1 remotely
> 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_0 remotely
> Thousands more lines, fetching the same 16 remote blocks
> 15/03/12 22:25:44 INFO BlockManager: Found block rdd_8_0 remotely
> 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
> 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
> 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
> 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
> {code}
> h2. Details for that stage from the UI.
> - *Total task time across all tasks:* 11.9 h
> - *Input:* 2.2 GB
> - *Shuffle read:* 4.5 MB
> h3. Summary Metrics for 176 Completed Tasks
> || Metric || Min || 25th percentile || Median || 75th percentile || Max ||
> | Duration | 7 s | 8 s | 8 s | 12 s | 59 min |
> | GC Time | 0 ms | 99 ms | 0.1 s | 0.2 s | 0.5 s |
> | Input | 6.9 MB | 8.2 MB | 8.4 MB | 9.0 MB | 11.0 MB |
> | Shuffle Read (Remote) | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 676.6 KB |
> h3. Aggregated Metrics by Executor
> || Executor ID || Address || Task Time || Total Tasks || Failed Tasks ||
> Succeeded Tasks || Input || Output || Shuffle Read || Shuffle Write ||
> Shuffle Spill (Memory) || Shuffle Spill (Disk) ||
> | 0 | n-62-23-3:49566 | 5.7 h | 9 | 0 | 9 | 171.0 MB | 0.0 B | 0.0 B | 0.0 B
> | 0.0 B | 0.0 B |
> | 1 | n-62-23-6:57518 | 16.4 h | 20 | 0 | 20 | 169.9 MB | 0.0 B | 0.0 B | 0.0
> B | 0.0 B | 0.0 B |
> | 2 | n-62-18-48:33551 | 0 ms | 0 | 0 | 0 | 169.6 MB | 0.0 B | 0.0 B | 0.0 B
> | 0.0 B | 0.0 B |
> | 3 | n-62-23-5:58421 | 2.9 min | 12 | 0 | 12 | 266.2 MB | 0.0 B | 4.5 MB |
> 0.0 B | 0.0 B | 0.0 B |
> | 4 | n-62-23-1:40096 | 23 min | 164 | 0 | 164 | 1430.4 MB | 0.0 B | 0.0 B |
> 0.0 B | 0.0 B | 0.0 B |
> h3. Tasks
> || Index || ID || Attempt || Status || Locality Level || Executor ID / Host
> || Launch Time || Duration || GC Time || Input || Shuffle Read || Errors ||
> | 1 | 2 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s |
> 0.1 s | 6.9 MB (memory) | 676.6 KB | |
> | 0 | 1 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min |
> 0.3 s | 8.7 MB (network) | 0.0 B | |
> | 4 | 5 | 0 | SUCCESS | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 38 min |
> 0.4 s | 8.6 MB (network) | 0.0 B | |
> | 3 | 4 | 0 | RUNNING | ANY | 2 / n-62-18-48 | 2015/03/12 21:55:00 | 55 min |
> | 8.3 MB (network) | 0.0 B | |
> | 2 | 3 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 11 s |
> 0.3 s | 8.4 MB (memory) | 0.0 B | |
> | 7 | 8 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 12 s |
> 0.3 s | 9.2 MB (memory) | 0.0 B | |
> | 6 | 7 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s |
> 0.1 s | 8.1 MB (memory) | 0.0 B | |
> | 5 | 6 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min |
> 0.3 s | 8.6 MB (network) | 0.0 B | |
> | 9 | 10 | 0 | RUNNING | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 55 min |
> | 8.7 MB (network) | 0.0 B | |
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]