Github user davies commented on the issue: https://github.com/apache/spark/pull/15923 Manually test this patch with a job that usually failed because of corrupt stream, as the logging said: ``` 16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_613_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again 16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_688_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again 16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_2434_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again 16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_878_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again 16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_1042_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again 16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_2301_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again 16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_2546_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again 16/11/20 08:32:27 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_3160_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again 16/11/20 08:32:27 WARN ShuffleBlockFetcherIterator: got an corrupted block shuffle_5_3601_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again ... 16/11/20 08:32:41 INFO Executor: Finished task 275.0 in stage 26.0 (TID 22187). 5219 bytes result sent to driver ``` The shuffle fetcher got some corrupt blocks for partition 275, it retried once, then the task finally succeeded. But the retry can not protect all the tasks, some failed as FetchFailed, then the stage is retried: ``` 26 2016/11/20 08:31:24 1.0min 403/1000 (2 failed) 205.6 GB 29.5 GB org.apache.spark.shuffle.FetchFailedException: Stream is corrupted 26 (retry 1) 2016/11/20 08:34:00 34s 200/629 (2 failed) 102.0 GB 14.6 GB org.apache.spark.shuffle.FetchFailedException: Stream is corrupted 26 (retry 2) 2016/11/20 08:35:25 1.8min 461/461 235.1 GB 33.7 GB ``` The stage 26 succeeded after retried twice. Another thing is that all the corruption happened only in 2 nodes out of 26. Also a few broadcast block is corrupt on them. They seems that the corruption happens on the receive (fetcher) side of network. I will update the patch to address comments.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org