[GitHub] [spark] squito commented on a change in pull request #25304: [SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API.

GitBox Tue, 06 Aug 2019 08:59:20 -0700

squito commented on a change in pull request #25304: 
[SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API.
URL: https://github.com/apache/spark/pull/25304#discussion_r311131698


 ##########
 File path: 
core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java
 ##########
 @@ -273,57 +259,93 @@ void forceSorterToSpill() throws IOException {
    *
    * @return the partition lengths in the merged file.
    */
-  private long[] mergeSpills(SpillInfo[] spills,
-      ShuffleMapOutputWriter mapWriter) throws IOException {
+  private long[] mergeSpills(SpillInfo[] spills) throws IOException {
+    long[] partitionLengths;
+    if (spills.length == 0) {
+      final ShuffleMapOutputWriter mapWriter = shuffleExecutorComponents
+          .createMapOutputWriter(
+              shuffleId,
+              mapId,
+              taskContext.taskAttemptId(),
+              partitioner.numPartitions());
+      mapWriter.commitAllPartitions();
+      return new long[partitioner.numPartitions()];
+    } else if (spills.length == 1) {
+      Optional<SingleFileShuffleMapOutputWriter> maybeSingleFileWriter =
+          shuffleExecutorComponents.createSingleFileMapOutputWriter(
+              shuffleId, mapId, taskContext.taskAttemptId());
+      if (maybeSingleFileWriter.isPresent()) {
+        // Here, we don't need to perform any metrics updates because the 
bytes written to this
+        // output file would have already been counted as shuffle bytes 
written.
+        partitionLengths = spills[0].partitionLengths;
+        maybeSingleFileWriter.get().transferMapOutputFile(spills[0].file, 
partitionLengths);
+      } else {
+        partitionLengths = mergeSpillsUsingStandardWriter(spills);
+      }
+    } else {
+      partitionLengths = mergeSpillsUsingStandardWriter(spills);
+    }
+    return partitionLengths;
+  }
+
+  private long[] mergeSpillsUsingStandardWriter(SpillInfo[] spills) throws 
IOException {
+    long[] partitionLengths;
     final boolean compressionEnabled = (boolean) 
sparkConf.get(package$.MODULE$.SHUFFLE_COMPRESS());
     final CompressionCodec compressionCodec = 
CompressionCodec$.MODULE$.createCodec(sparkConf);
     final boolean fastMergeEnabled =
-      (boolean) 
sparkConf.get(package$.MODULE$.SHUFFLE_UNDAFE_FAST_MERGE_ENABLE());
+        (boolean) 
sparkConf.get(package$.MODULE$.SHUFFLE_UNDAFE_FAST_MERGE_ENABLE());
     final boolean fastMergeIsSupported = !compressionEnabled ||
-      
CompressionCodec$.MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
+        
CompressionCodec$.MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
     final boolean encryptionEnabled = 
blockManager.serializerManager().encryptionEnabled();
-    final int numPartitions = partitioner.numPartitions();
-    long[] partitionLengths = new long[numPartitions];
+    final ShuffleMapOutputWriter mapWriter = shuffleExecutorComponents
+        .createMapOutputWriter(
+            shuffleId,
+            mapId,
+            taskContext.taskAttemptId(),
+            partitioner.numPartitions());
     try {
-      if (spills.length == 0) {
-        return partitionLengths;
-      } else {
-        // There are multiple spills to merge, so none of these spill files' 
lengths were counted
-        // towards our shuffle write count or shuffle write time. If we use 
the slow merge path,
-        // then the final output file's size won't necessarily be equal to the 
sum of the spill
-        // files' sizes. To guard against this case, we look at the output 
file's actual size when
-        // computing shuffle bytes written.
-        //
-        // We allow the individual merge methods to report their own IO times 
since different merge
-        // strategies use different IO techniques.  We count IO during merge 
towards the shuffle
-        // shuffle write time, which appears to be consistent with the "not 
bypassing merge-sort"
-        // branch in ExternalSorter.
-        if (fastMergeEnabled && fastMergeIsSupported) {
-          // Compression is disabled or we are using an IO compression codec 
that supports
-          // decompression of concatenated compressed streams, so we can 
perform a fast spill merge
-          // that doesn't need to interpret the spilled bytes.
-          if (transferToEnabled && !encryptionEnabled) {
-            logger.debug("Using transferTo-based fast merge");
-            partitionLengths = mergeSpillsWithTransferTo(spills, mapWriter);
-          } else {
-            logger.debug("Using fileStream-based fast merge");
-            partitionLengths = mergeSpillsWithFileStream(spills, mapWriter, 
null);
-          }
+      // There are multiple spills to merge, so none of these spill files' 
lengths were counted
+      // towards our shuffle write count or shuffle write time. If we use the 
slow merge path,
+      // then the final output file's size won't necessarily be equal to the 
sum of the spill
+      // files' sizes. To guard against this case, we look at the output 
file's actual size when
+      // computing shuffle bytes written.
+      //
+      // We allow the individual merge methods to report their own IO times 
since different merge
+      // strategies use different IO techniques.  We count IO during merge 
towards the shuffle
+      // shuffle write time, which appears to be consistent with the "not 
bypassing merge-sort"
 
 Review comment:
   typo: shuffle shuffle (it was there before, might as well fix it now)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] squito commented on a change in pull request #25304: [SPARK-28570][CORE][SHUFFLE] Make UnsafeShuffleWriter use the new API.

Reply via email to