[spark] branch master updated: [SPARK-38191][CORE][FOLLOWUP] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol

srowen Wed, 02 Mar 2022 16:02:08 -0800

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 23db9b4  [SPARK-38191][CORE][FOLLOWUP] The staging directory of write 
job only needs to be initialized once in HadoopMapReduceCommitProtocol
23db9b4 is described below

commit 23db9b440ba70f4edf1f4a604f4829e1831ea502
Author: weixiuli <[email protected]>
AuthorDate: Wed Mar 2 18:00:56 2022 -0600

    [SPARK-38191][CORE][FOLLOWUP] The staging directory of write job only needs 
to be initialized once in HadoopMapReduceCommitProtocol
    
    ### What changes were proposed in this pull request?
    
    This pr follows up the https://github.com/apache/spark/pull/35492, try to 
use a stagingDir constant instead of the  stagingDir method in 
HadoopMapReduceCommitProtocol.
    
    ### Why are the changes needed?
    
    In the https://github.com/apache/spark/pull/35492#issuecomment-1054910730
    
    ```
    ./build/sbt -mem 4096 -Phadoop-2 "sql/testOnly 
org.apache.spark.sql.sources.PartitionedWriteSuite -- -z SPARK-27194"
    ...
    [info]   Cause: org.apache.spark.SparkException: Task not serializable
    ...
    [info]   Cause: java.io.NotSerializableException: org.apache.hadoop.fs.Path
    ...
    
    ```
    It's because org.apache.hadoop.fs.Path is serializable in Hadoop3 but not 
in Hadoop2.  So, we should make the stagingDir  transient to avoid that.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    ### How was this patch tested?
    
    Passed `./build/sbt -mem 4096 -Phadoop-2 "sql/testOnly 
org.apache.spark.sql.sources.PartitionedWriteSuite -- -z SPARK-27194"`
    
    Pass the CIs.
    
    Closes #35693 from weixiuli/staging-directory.
    
    Authored-by: weixiuli <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
---
 .../org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
index a39e9ab..3a24da9 100644
--- 
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
+++ 
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
    * The staging directory of this write job. Spark uses it to deal with files 
with absolute output
    * path, or writing data into partitioned directory with 
dynamicPartitionOverwrite=true.
    */
-  protected def stagingDir = getStagingDir(path, jobId)
+  @transient protected lazy val stagingDir = getStagingDir(path, jobId)
 
   protected def setupCommitter(context: TaskAttemptContext): OutputCommitter = 
{
     val format = context.getOutputFormatClass.getConstructor().newInstance()

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-38191][CORE][FOLLOWUP] The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol

Reply via email to