GitHub user dilipbiswal opened a pull request:
https://github.com/apache/spark/pull/20525
SPARK-23271 Parquet output contains only _SUCCESS file after writing an
empty dataframe
## What changes were proposed in this pull request?
Below are the two cases.
``` SQL
case 1
scala> List.empty[String].toDF().rdd.partitions.length
res18: Int = 1
```
When we write the above data frame as parquet, we create a parquet file
containing
just the schema of the data frame.
Case 2
``` SQL
scala> val anySchema = StructType(StructField("anyName", StringType,
nullable = false) :: Nil)
anySchema: org.apache.spark.sql.types.StructType =
StructType(StructField(anyName,StringType,false))
scala>
spark.read.schema(anySchema).csv("/tmp/empty_folder").rdd.partitions.length
res22: Int = 0
```
For the 2nd case, since number of partitions = 0, we don't call the write
task (the task has logic to create the empty metadata only parquet file)
The fix attempts to repartition the empty rdd to size 1 before we proceed
to setup the write
job.
## How was this patch tested?
A new test is added to DataframeReaderWriterSuite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dilipbiswal/spark spark-23271
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20525.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20525
----
commit 2764b1c0aa43a104393da909f388861209220d4f
Author: Dilip Biswal <dbiswal@...>
Date: 2018-02-07T07:45:32Z
SPARK-23271 Parquet output contains only _SUCCESS file after writing an
empty dataframe
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]