[GitHub] spark pull request #20525: [SPARK-23271[SQL] Parquet output contains only _S...

dilipbiswal Wed, 07 Feb 2018 08:46:47 -0800

Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20525#discussion_r166679420
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
    @@ -190,9 +190,13 @@ object FileFormatWriter extends Logging {
               global = false,
               child = plan).execute()
           }
    -      val ret = new Array[WriteTaskResult](rdd.partitions.length)
    +
    +      // SPARK-23271 If we are attempting to write a zero partition rdd, 
change the number of
    +      // partition to 1 to make sure we at least set up one write task to 
write the metadata.
    +      val finalRdd = if (rdd.partitions.length == 0) rdd.repartition(1) 
else rdd
    --- End diff --
    
    @cloud-fan @pashazm I was thinking, this would not be a regular event to 
write empty datasets , right ? Should we be even optimizing this path ? 
Secondly, is shuffling an empty data set that expensive ?
    
    @cloud-fan, actually i had tried to launch a write task for empty RDD, but 
was hitting a NullPointerException from scheduler ? Looks like things are setup 
to only work off of partitions of RDD. Could we try to create this empty 
metadata file from the driver in this case ? If we go that route, then we may 
have to refactor the write task code. Seems like a lot for this little corner 
case, what do you think ?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20525: [SPARK-23271[SQL] Parquet output contains only _S...

Reply via email to