[GitHub] [spark] zsxwing commented on a change in pull request #33002: [SPARK-35843][SQL] Unify the file name between batch and streaming file writers

GitBox Mon, 21 Jun 2021 11:52:51 -0700


zsxwing commented on a change in pull request #33002:
URL: https://github.com/apache/spark/pull/33002#discussion_r655627123




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala
##########
@@ -113,12 +113,15 @@ class ManifestFileCommitProtocol(jobId: String, path: 
String)
 
   override def newTaskTempFile(
       taskContext: TaskAttemptContext, dir: Option[String], ext: String): 
String = {
-    // The file name looks like 
part-r-00000-2dd664f9-d2c4-4ffe-878f-c6c70c1fb0cb_00003.gz.parquet
-    // Note that %05d does not truncate the split number, so if we have more 
than 100000 tasks,
+    // Use the Spark task attempt ID which is unique within the write job, so 
that file writes never
+    // collide if the file name also includes job ID. The Hadoop task id is 
equivalent to Spark's
+    // partitionId, which is not unique within the write job, for cases like 
task retry or
+    // speculative tasks.
+    val taskId = TaskContext.get.taskAttemptId()
+    // The file name looks like 
part-00000-2dd664f9-d2c4-4ffe-878f-c6c70c1fb0cb_00003.gz.parquet
+    // Note that %05d does not truncate the taskId, so if we have more than 
100000 tasks,
     // the file name is fine and won't overflow.
-    val split = taskContext.getTaskAttemptID.getTaskID.getId
-    val uuid = UUID.randomUUID.toString
-    val filename = f"part-$split%05d-$uuid$ext"

Review comment:
       > 2\. The file output committe for streaming does not use staging 
directories. It writes files to the final path directly and uses a manifest 
file to track the committed files. Thus, partition ID is not sufficient to 
avoid file name collision. That's why we add a fresh UUID to the file name.
   
   Could you explain this? Currently `ManifestFileCommitProtocol` should always 
pick up a new uuid for each file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zsxwing commented on a change in pull request #33002: [SPARK-35843][SQL] Unify the file name between batch and streaming file writers

Reply via email to