[GitHub] [spark] weixiuli commented on a change in pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

GitBox Tue, 22 Feb 2022 21:20:49 -0800


weixiuli commented on a change in pull request #35492:
URL: https://github.com/apache/spark/pull/35492#discussion_r812567016




##########
File path: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
    * The staging directory of this write job. Spark uses it to deal with files 
with absolute output
    * path, or writing data into partitioned directory with 
dynamicPartitionOverwrite=true.
    */
-  protected def stagingDir = getStagingDir(path, jobId)
+  @transient protected lazy val stagingDir = getStagingDir(path, jobId)

Review comment:
       I have already checked that  the  OutputCommitCoordinatorSuite will fail 
when the stagingDir is not layz.
   
   such as ：
   
   ```scala
    test("If commit fails, if task is retried it should not be locked, and will 
succeed.") {
       val rdd = sc.parallelize(Seq(1), 1)
       sc.runJob(rdd, 
OutputCommitFunctions(tempDir.getAbsolutePath).failFirstCommitAttempt _,
         0 until rdd.partitions.size)
       assert(tempDir.list().size === 1)
     }
   ```
   
   ```
   Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.0.78.226 executor 
driver): java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at 
org.apache.spark.internal.io.FileCommitProtocol$.instantiate(FileCommitProtocol.scala:228)
        at 
org.apache.spark.scheduler.OutputCommitFunctions.runCommitWithProvidedCommitter(OutputCommitCoordinatorSuite.scala:316)
        at 
org.apache.spark.scheduler.OutputCommitFunctions.failFirstCommitAttempt(OutputCommitCoordinatorSuite.scala:304)
        at 
org.apache.spark.scheduler.OutputCommitCoordinatorSuite.$anonfun$new$8(OutputCommitCoordinatorSuite.scala:148)
        at 
org.apache.spark.scheduler.OutputCommitCoordinatorSuite.$anonfun$new$8$adapted(OutputCommitCoordinatorSuite.scala:148)
        at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalArgumentException: Can not create a Path from a 
null string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)
        at org.apache.hadoop.fs.Path.<init>(Path.java:184)
        at org.apache.hadoop.fs.Path.<init>(Path.java:119)
        at 
org.apache.spark.internal.io.FileCommitProtocol$.getStagingDir(FileCommitProtocol.scala:233)
        at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.<init>(HadoopMapReduceCommitProtocol.scala:107)
        at 
org.apache.spark.internal.io.HadoopMapRedCommitProtocol.<init>(HadoopMapRedCommitProtocol.scala:30)
        ... 18 more
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] weixiuli commented on a change in pull request #35492: [SPARK-38191][CORE] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

Reply via email to