Github user davies commented on the issue:

    https://github.com/apache/spark/pull/15923
  
    Manually test this patch with a job that usually failed because of corrupt 
stream, as the logging said:
    ```
    16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_613_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
    16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_688_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
    16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_2434_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
    16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_878_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
    16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_1042_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
    16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_2301_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
    16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_2546_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
    16/11/20 08:32:27 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_3160_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
    16/11/20 08:32:27 WARN ShuffleBlockFetcherIterator: got an corrupted block 
shuffle_5_3601_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
    ...
    16/11/20 08:32:41 INFO Executor: Finished task 275.0 in stage 26.0 (TID 
22187). 5219 bytes result sent to driver
    ```
    The shuffle fetcher got some corrupt blocks for partition 275, it retried 
once, then the task finally succeeded.
    
    But the retry can not protect all the tasks, some failed as FetchFailed, 
then the stage is retried:
    ```
    26     2016/11/20 08:31:24  1.0min  403/1000 (2 failed)                     
205.6 GB        29.5 GB org.apache.spark.shuffle.FetchFailedException: Stream 
is corrupted
    
    26 (retry 1)  2016/11/20 08:34:00   34s     200/629 (2 failed)              
        102.0 GB        14.6 GB org.apache.spark.shuffle.FetchFailedException: 
Stream is corrupted
    
    26 (retry 2)  2016/11/20 08:35:25   1.8min  461/461                 235.1 
GB        33.7 GB
    ```
    
    The stage 26 succeeded after retried twice.
    
    Another thing is that all the corruption happened only in 2 nodes out of 
26. Also a few broadcast block is corrupt on them. They seems that the 
corruption happens on the receive (fetcher) side of network.
    
    I will update the patch to address comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to