[
https://issues.apache.org/jira/browse/SPARK-33022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
duanmeng updated SPARK-33022:
-----------------------------
Description:
A data file might be empty even after DiskBlockObjectWriter committing it in
BypassMergeSortShuffleWriter, returned wrong lengths in writePartitionedFile,
and then cause data lost. This is related to disk/kernel but we can avoid it in
spark without any performance loss. We can compare
partitionWriterSegments[i].length with the length[i] after Utils.copyStream.
I added some logs and caught the failure,
The log when this issue happened
{code:java}
20/09/28 00:42:44 INFO sort.BypassMergeSortShuffleWriter:
partitionWriterSegments[0]:
(name=temp_shuffle_38244ef5-8e97-4428-97b8-feffc16fc9f7, offset=0, length=1462)
20/09/28 00:42:46 INFO sort.BypassMergeSortShuffleWriter: File length: 0
20/09/28 00:42:46 INFO sort.BypassMergeSortShuffleWriter: Copied stream length:
0{code}
The peer log when this issue didn't happen
{code:java}
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter:
partitionWriterSegments[0]:
(name=temp_shuffle_f6937469-39fd-4576-b40e-69f4276cc8e4, offset=0, length=1462)
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter: File length: 1462
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter: Copied stream length:
1462
{code}
was:
A data file might be empty even after DiskBlockObjectWriter committing it in
BypassMergeSortShuffleWriter, returned wrong lengths in writePartitionedFile,
and then cause data lost. This is related to disk/kernel but we can avoid it in
spark without any performance loss. We can compare
partitionWriterSegments[i].length with the length[i] after Utils.copyStream.
I added some logs and caught the failure,
The log when this issue happened
{code:java}
20/09/28 00:42:44 INFO sort.BypassMergeSortShuffleWriter:
partitionWriterSegments[0]:
(name=temp_shuffle_38244ef5-8e97-4428-97b8-feffc16fc9f7, offset=0, length=1462)
20/09/28 00:42:46 INFO sort.BypassMergeSortShuffleWriter: File length: 0
20/09/28 00:42:46 INFO sort.BypassMergeSortShuffleWriter: Copied stream length:
0{code}
The peer log when this issue didn't happen
{code:java}
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter:
partitionWriterSegments[0]:
(name=temp_shuffle_f6937469-39fd-4576-b40e-69f4276cc8e4, offset=0, length=1462)
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter: File length: 1462
20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter: Copied stream length:
1462
{code}
> partition length is wrong after merge partition segments in
> BypassMergeSortShuffleWriter
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-33022
> URL: https://issues.apache.org/jira/browse/SPARK-33022
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.6
> Reporter: duanmeng
> Priority: Major
>
> A data file might be empty even after DiskBlockObjectWriter committing it in
> BypassMergeSortShuffleWriter, returned wrong lengths in writePartitionedFile,
> and then cause data lost. This is related to disk/kernel but we can avoid it
> in spark without any performance loss. We can compare
> partitionWriterSegments[i].length with the length[i] after Utils.copyStream.
> I added some logs and caught the failure,
> The log when this issue happened
> {code:java}
> 20/09/28 00:42:44 INFO sort.BypassMergeSortShuffleWriter:
> partitionWriterSegments[0]:
> (name=temp_shuffle_38244ef5-8e97-4428-97b8-feffc16fc9f7, offset=0,
> length=1462)
> 20/09/28 00:42:46 INFO sort.BypassMergeSortShuffleWriter: File length: 0
> 20/09/28 00:42:46 INFO sort.BypassMergeSortShuffleWriter: Copied stream
> length: 0{code}
>
> The peer log when this issue didn't happen
> {code:java}
> 20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter:
> partitionWriterSegments[0]:
> (name=temp_shuffle_f6937469-39fd-4576-b40e-69f4276cc8e4, offset=0,
> length=1462)
> 20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter: File length: 1462
> 20/09/28 10:11:45 INFO sort.BypassMergeSortShuffleWriter: Copied stream
> length: 1462
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]