Github user juanrh commented on the issue:
https://github.com/apache/spark/pull/19583
Even though we only wait 5 seconds by default between retries, the retries
themselves can take a lot of time. For example in a simple word count job where
a node is lost during stage 1.0 I have seen the following logs for
`RetryingBlockFetcher`:
```
17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 332
outstanding blocks after 5000 ms
17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 286
outstanding blocks after 5000 ms
17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 312
outstanding blocks after 5000 ms
17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 264
outstanding blocks after 5000 ms
17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 294
outstanding blocks after 5000 ms
17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 316
outstanding blocks after 5000 ms
17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 276
outstanding blocks after 5000 ms
17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 327
outstanding blocks after 5000 ms
17/09/13 23:36:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 316
outstanding blocks after 5000 ms
17/09/13 23:38:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 264
outstanding blocks after 5000 ms
17/09/13 23:40:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 294
outstanding blocks after 5000 ms
17/09/13 23:42:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 286
outstanding blocks after 5000 ms
17/09/13 23:44:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 312
outstanding blocks after 5000 ms
17/09/13 23:46:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 332
outstanding blocks after 5000 ms
17/09/13 23:48:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 327
outstanding blocks after 5000 ms
17/09/13 23:50:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 276
outstanding blocks after 5000 ms
17/09/13 23:50:31 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 327
outstanding blocks after 5000 ms
17/09/13 23:50:34 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 332
outstanding blocks after 5000 ms
17/09/13 23:50:37 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 312
outstanding blocks after 5000 ms
17/09/13 23:50:40 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 286
outstanding blocks after 5000 ms
17/09/13 23:50:43 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 294
outstanding blocks after 5000 ms
17/09/13 23:50:46 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 264
outstanding blocks after 5000 ms
17/09/13 23:50:49 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 316
outstanding blocks after 5000 ms
17/09/13 23:51:07 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 276
outstanding blocks after 5000 ms
```
where the job has the following stage submission and completion events
```
17/09/13 23:21:57 INFO DAGScheduler: Submitting ShuffleMapStage 0
(MapPartitionsRDD[2] at flatMap at WordCount.scala:64), which has no missing
parents
17/09/13 23:31:34 INFO DAGScheduler: Submitting ResultStage 1
(MapPartitionsRDD[4] at saveAsTextFile at WordCount.scala:110), which has no
missing parents
17/09/13 23:50:51 INFO DAGScheduler: Submitting ShuffleMapStage 0
(MapPartitionsRDD[2] at flatMap at WordCount.scala:64), which has no missing
parents
17/09/13 23:52:19 INFO DAGScheduler: Submitting ResultStage 1
(MapPartitionsRDD[4] at saveAsTextFile at WordCount.scala:110), which has no
missing parents
17/09/13 23:31:34 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at
WordCount.scala:64) finished in 575.500 s
17/09/13 23:52:19 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at
WordCount.scala:64) finished in 87.455 s
17/09/13 23:59:07 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at
WordCount.scala:110) finished in 408.167 s
```
This means that approximately 16 minutes of a 38 minutes job where spent in
retries, which is 42% of the job time. This might impact applications that lose
a node at a moment when a single action is running. For example applications
with a single action at the end, for writing the application results to
external storage.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]