Github user dilipbiswal commented on a diff in the pull request:
https://github.com/apache/spark/pull/20525#discussion_r166679420
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
---
@@ -190,9 +190,13 @@ object FileFormatWriter extends Logging {
global = false,
child = plan).execute()
}
- val ret = new Array[WriteTaskResult](rdd.partitions.length)
+
+ // SPARK-23271 If we are attempting to write a zero partition rdd,
change the number of
+ // partition to 1 to make sure we at least set up one write task to
write the metadata.
+ val finalRdd = if (rdd.partitions.length == 0) rdd.repartition(1)
else rdd
--- End diff --
@cloud-fan @pashazm I was thinking, this would not be a regular event to
write empty datasets , right ? Should we be even optimizing this path ?
Secondly, is shuffling an empty data set that expensive ?
@cloud-fan, actually i had tried to launch a write task for empty RDD, but
was hitting a NullPointerException from scheduler ? Looks like things are setup
to only work off of partitions of RDD. Could we try to create this empty
metadata file from the driver in this case ? If we go that route, then we may
have to refactor the write task code. Seems like a lot for this little corner
case, what do you think ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]