duanmeng opened a new pull request #29907:
URL: https://github.com/apache/spark/pull/29907
The segment file may be empty even after writer committing caused by
disk/kernel in heavy load cluster. Compare segment length with copied length to
guard it.
A data file might be empty even after DiskBlockObjectWriter committing it
in BypassMergeSortShuffleWriter, returned wrong lengths in writePartitionedFile
used by MapStatus, and then cause data lost.
This is related to disk/kernel but we can avoid it in spark without any
performance loss. We can compare partitionWriterSegments[i].length with the
length[i] after Utils.copyStream.
I added some logs and reproduce the issue,
The log when this issue happened
```
20/09/28 00:42:44 INFO sort.BypassMergeSortShuffleWriter:
partitionWriterSegments[0]:
(name=temp_shuffle_38244ef5-8e97-4428-97b8-feffc16fc9f7, offset=0, length=1462)
20/09/28 00:42:46 INFO sort.BypassMergeSortShuffleWriter: File length: 0
20/09/28 00:42:46 INFO sort.BypassMergeSortShuffleWriter: Copied stream
length: 0
```
The peer log when this issue didn't happen
```
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter:
partitionWriterSegments[0]:
(name=temp_shuffle_f6937469-39fd-4576-b40e-69f4276cc8e4, offset=0, length=1462)
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter: File length: 1462
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter: Copied stream
length: 1462
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]