squito commented on a change in pull request #25007: [SPARK-28209][CORE][SHUFFLE] Proposed new shuffle writer API URL: https://github.com/apache/spark/pull/25007#discussion_r308403081
########## File path: core/src/main/java/org/apache/spark/shuffle/api/ShuffleWriteSupport.java ########## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle.api; + +import java.io.IOException; + +import org.apache.spark.annotation.Private; +import org.apache.spark.shuffle.ShuffleWriteMetricsReporter; + +/** + * :: Private :: + * A module that returns shuffle writers to persist data that is written by shuffle map tasks. + * + * @since 3.0.0 + */ +@Private +public interface ShuffleWriteSupport { + + /** + * Called once per map task to create a writer that will be responsible for persisting all the + * partitioned bytes written by that map task. + * + * @param shuffleId Unique identifier for the shuffle stage of the map task + * @param mapId Within the shuffle stage, the identifier of the map task + * @param mapTaskAttemptId Identifier of the task attempt. Multiple attempts of the same map task + * with the same (shuffleId, mapId) pair can be distinguished by the + * different values of mapTaskAttemptId. + * @param numPartitions The number of partitions that will be written by the map task. Some of + * these partitions may be empty. + * @param mapTaskWriteMetrics The map task's write metrics, which can be updated by the returned + * writer. The updates that are posted to this reporter are listed in + * the Spark UI. Note that the caller will update the total write time + * at the end of the map task, so implementations should not call + * {@link ShuffleWriteMetricsReporter#incWriteTime(long)}. + */ + ShuffleMapOutputWriter createMapOutputWriter( + int shuffleId, + int mapId, + long mapTaskAttemptId, + int numPartitions, + ShuffleWriteMetricsReporter mapTaskWriteMetrics) throws IOException; Review comment: sorry for the delays from me. So after a closer look, I actually am pretty sure we should remove this from the api, and also any use of it from `LocalDiskShuffleMapOutputWriter`. That also means that test change I was originally commenting on, which sets the TaskContext, could also be removed. I think the current code in this patch is wrong, its double counting the write time for the final merged file. The original code did *not* create a TimeTrackingOutputStream for the merged file -- it just counted the time for the total creation of that file. https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L190-L213 and it seems like we'd have to do that, as we might be using a channel there, and then we wouldn't have an equivalent way of doing it for the channel. The current code in this pr does something similar here in `BypassMergeSortShuffleWriter.writePartitionedData()`: https://github.com/apache/spark/pull/25007/files#diff-8b6b7a5dadc0d8e97307d0f8e8378d8fR247 But its also passing that to the `LocalDiskShuffleMapOutputWriter` in a `TimeTrackingOutputStream`: https://github.com/apache/spark/pull/25007/files#diff-17636cf695d4c63ea3e15c3d71d63707R133 Sorry I should have looked at the use of those metrics more closely in the first place. But I think this means we can remove that metrics object from the api entirely. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
