[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #275: [ISSUE-239][Problem] RssUtils#transIndexDataToSegments should consider the length of the data file

GitBox Fri, 21 Oct 2022 02:00:51 -0700


zuston commented on code in PR #275:
URL: https://github.com/apache/incubator-uniffle/pull/275#discussion_r1001549008



##########
common/src/main/java/org/apache/uniffle/common/util/RssUtils.java:
##########
@@ -218,6 +220,13 @@ private static List<ShuffleDataSegment> 
transIndexDataToSegments(byte[] indexDat
 
         bufferSegments.add(new BufferSegment(blockId, bufferOffset, length, 
uncompressLength, crc, taskAttemptId));
         bufferOffset += length;
+        totalLength += length;
+
+        // If ShuffleServer is flushing the file at this time, the length in 
the index file record may be greater

Review Comment:
   I think this problem only occur that the map tasks are all finished and the 
data stored in memory is flushed to localfile/HDFS, right? And in this time, 
the spark client read the redundant index data. Right? 
   
   Analyzed from this perspective, if u drop these redundant data, does it will 
cause the data missing problem due to in hdfs client buffer insteading of 
memory/HDFS. I think it wont. The flushing data only will be flushed to HDFS 
and then removed from memory. And the method of `dataWriter.close()` in 
`HdfsShuffleWriteHandler` will ensure the data flushed to HDFS.
   
   So this change is OK and wont cause data lose. But I have a question that 
why not calling the `dataWriter.flush` and `indexWriter.flush` when writing one 
block to solve this problem. Does this will make performance regession?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #275: [ISSUE-239][Problem] RssUtils#transIndexDataToSegments should consider the length of the data file

Reply via email to