zuston commented on code in PR #275:
URL: https://github.com/apache/incubator-uniffle/pull/275#discussion_r1001549008
##########
common/src/main/java/org/apache/uniffle/common/util/RssUtils.java:
##########
@@ -218,6 +220,13 @@ private static List<ShuffleDataSegment>
transIndexDataToSegments(byte[] indexDat
bufferSegments.add(new BufferSegment(blockId, bufferOffset, length,
uncompressLength, crc, taskAttemptId));
bufferOffset += length;
+ totalLength += length;
+
+ // If ShuffleServer is flushing the file at this time, the length in
the index file record may be greater
Review Comment:
I think this problem only occur that the map tasks are all finished and the
data stored in memory is flushed to localfile/HDFS, right? And in this time,
the spark client read the redundant index data. Right?
Analyzed from this perspective, if u drop these redundant data, does it will
cause the data missing problem due to in hdfs client buffer insteading of
memory/HDFS. I think it wont. The flushing data only will be flushed to HDFS
and then removed from memory. And the method of `dataWriter.close()` in
`HdfsShuffleWriteHandler` will ensure the data flushed to HDFS.
So this change is OK and wont cause data lose. But I have a question that
why not calling the `dataWriter.flush` and `indexWriter.flush` when writing one
block to solve this problem. Does this will make performance regession?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]