Github user juanrh commented on the issue:

    https://github.com/apache/spark/pull/19583
  
    Even though we only wait 5 seconds by default between retries, the retries 
themselves can take a lot of time. For example in a simple word count job where 
a node is lost during stage 1.0 I have seen the following logs for 
`RetryingBlockFetcher`: 
    
    ```
    17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 332 
outstanding blocks after 5000 ms
    17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 286 
outstanding blocks after 5000 ms
    17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 312 
outstanding blocks after 5000 ms
    17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 264 
outstanding blocks after 5000 ms
    17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 294 
outstanding blocks after 5000 ms
    17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 316 
outstanding blocks after 5000 ms
    17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 276 
outstanding blocks after 5000 ms
    17/09/13 23:34:23 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 327 
outstanding blocks after 5000 ms
    17/09/13 23:36:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 316 
outstanding blocks after 5000 ms
    17/09/13 23:38:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 264 
outstanding blocks after 5000 ms
    17/09/13 23:40:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 294 
outstanding blocks after 5000 ms
    17/09/13 23:42:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 286 
outstanding blocks after 5000 ms
    17/09/13 23:44:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 312 
outstanding blocks after 5000 ms
    17/09/13 23:46:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 332 
outstanding blocks after 5000 ms
    17/09/13 23:48:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 327 
outstanding blocks after 5000 ms
    17/09/13 23:50:28 INFO RetryingBlockFetcher: Retrying fetch (2/3) for 276 
outstanding blocks after 5000 ms
    17/09/13 23:50:31 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 327 
outstanding blocks after 5000 ms
    17/09/13 23:50:34 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 332 
outstanding blocks after 5000 ms
    17/09/13 23:50:37 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 312 
outstanding blocks after 5000 ms
    17/09/13 23:50:40 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 286 
outstanding blocks after 5000 ms
    17/09/13 23:50:43 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 294 
outstanding blocks after 5000 ms
    17/09/13 23:50:46 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 264 
outstanding blocks after 5000 ms
    17/09/13 23:50:49 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 316 
outstanding blocks after 5000 ms
    17/09/13 23:51:07 INFO RetryingBlockFetcher: Retrying fetch (3/3) for 276 
outstanding blocks after 5000 ms
    ```
    
    where the job has the following stage submission and completion events 
    
    ```
    17/09/13 23:21:57 INFO DAGScheduler: Submitting ShuffleMapStage 0 
(MapPartitionsRDD[2] at flatMap at WordCount.scala:64), which has no missing 
parents
    17/09/13 23:31:34 INFO DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[4] at saveAsTextFile at WordCount.scala:110), which has no 
missing parents
    17/09/13 23:50:51 INFO DAGScheduler: Submitting ShuffleMapStage 0 
(MapPartitionsRDD[2] at flatMap at WordCount.scala:64), which has no missing 
parents
    17/09/13 23:52:19 INFO DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[4] at saveAsTextFile at WordCount.scala:110), which has no 
missing parents
    
    17/09/13 23:31:34 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at 
WordCount.scala:64) finished in 575.500 s
    17/09/13 23:52:19 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at 
WordCount.scala:64) finished in 87.455 s
    17/09/13 23:59:07 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at 
WordCount.scala:110) finished in 408.167 s
    ```
    
    This means that approximately 16 minutes of a 38 minutes job where spent in 
retries, which is 42% of the job time. This might impact applications that lose 
a node at a moment when a single action is running. For example applications 
with a single action at the end, for writing the application results to 
external storage.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to