This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 23db9b4 [SPARK-38191][CORE][FOLLOWUP] The staging directory of write
job only needs to be initialized once in HadoopMapReduceCommitProtocol
23db9b4 is described below
commit 23db9b440ba70f4edf1f4a604f4829e1831ea502
Author: weixiuli <[email protected]>
AuthorDate: Wed Mar 2 18:00:56 2022 -0600
[SPARK-38191][CORE][FOLLOWUP] The staging directory of write job only needs
to be initialized once in HadoopMapReduceCommitProtocol
### What changes were proposed in this pull request?
This pr follows up the https://github.com/apache/spark/pull/35492, try to
use a stagingDir constant instead of the stagingDir method in
HadoopMapReduceCommitProtocol.
### Why are the changes needed?
In the https://github.com/apache/spark/pull/35492#issuecomment-1054910730
```
./build/sbt -mem 4096 -Phadoop-2 "sql/testOnly
org.apache.spark.sql.sources.PartitionedWriteSuite -- -z SPARK-27194"
...
[info] Cause: org.apache.spark.SparkException: Task not serializable
...
[info] Cause: java.io.NotSerializableException: org.apache.hadoop.fs.Path
...
```
It's because org.apache.hadoop.fs.Path is serializable in Hadoop3 but not
in Hadoop2. So, we should make the stagingDir transient to avoid that.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Passed `./build/sbt -mem 4096 -Phadoop-2 "sql/testOnly
org.apache.spark.sql.sources.PartitionedWriteSuite -- -z SPARK-27194"`
Pass the CIs.
Closes #35693 from weixiuli/staging-directory.
Authored-by: weixiuli <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
---
.../org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
index a39e9ab..3a24da9 100644
---
a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
+++
b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
@@ -104,7 +104,7 @@ class HadoopMapReduceCommitProtocol(
* The staging directory of this write job. Spark uses it to deal with files
with absolute output
* path, or writing data into partitioned directory with
dynamicPartitionOverwrite=true.
*/
- protected def stagingDir = getStagingDir(path, jobId)
+ @transient protected lazy val stagingDir = getStagingDir(path, jobId)
protected def setupCommitter(context: TaskAttemptContext): OutputCommitter =
{
val format = context.getOutputFormatClass.getConstructor().newInstance()
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]