Github user davies commented on the issue:
https://github.com/apache/spark/pull/15923
Manually test this patch with a job that usually failed because of corrupt
stream, as the logging said:
```
16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_613_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_688_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
16/11/20 08:32:07 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_2434_275 from BlockManagerId(6, 10.1.109.163, 34744), fetch again
16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_878_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_1042_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_2301_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:26 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_2546_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:27 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_3160_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
16/11/20 08:32:27 WARN ShuffleBlockFetcherIterator: got an corrupted block
shuffle_5_3601_275 from BlockManagerId(21, 10.1.107.237, 35876), fetch again
...
16/11/20 08:32:41 INFO Executor: Finished task 275.0 in stage 26.0 (TID
22187). 5219 bytes result sent to driver
```
The shuffle fetcher got some corrupt blocks for partition 275, it retried
once, then the task finally succeeded.
But the retry can not protect all the tasks, some failed as FetchFailed,
then the stage is retried:
```
26 2016/11/20 08:31:24 1.0min 403/1000 (2 failed)
205.6 GB 29.5 GB org.apache.spark.shuffle.FetchFailedException: Stream
is corrupted
26 (retry 1) 2016/11/20 08:34:00 34s 200/629 (2 failed)
102.0 GB 14.6 GB org.apache.spark.shuffle.FetchFailedException:
Stream is corrupted
26 (retry 2) 2016/11/20 08:35:25 1.8min 461/461 235.1
GB 33.7 GB
```
The stage 26 succeeded after retried twice.
Another thing is that all the corruption happened only in 2 nodes out of
26. Also a few broadcast block is corrupt on them. They seems that the
corruption happens on the receive (fetcher) side of network.
I will update the patch to address comments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]