GitHub user dongjoon-hyun opened a pull request:
https://github.com/apache/spark/pull/22378
[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent
duplicate fields
## What changes were proposed in this pull request?
Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY
STORED AS` should not generate files with duplicate fields because Spark cannot
read those files back.
**INSERT OVERWRITE DIRECTORY USING**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet
SELECT 'id', 'id2' id")
... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
org.apache.spark.sql.AnalysisException: Found duplicate column(s) when
inserting into file:/tmp/parquet: `id`;
```
**INSERT OVERWRITE DIRECTORY STORED AS**
```scala
scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS
parquet SELECT 'id', 'id2' id")
// It generates corrupted files
scala> spark.read.parquet("/tmp/parquet").show
18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data
schema and the partition schema: `id`;
```
## How was this patch tested?
Pass the Jenkins with newly added test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dongjoon-hyun/spark SPARK-25389
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22378.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22378
----
commit 02425765d243bdd4aeaccd71851b0457be108690
Author: Dongjoon Hyun <dongjoon@...>
Date: 2018-09-10T05:24:33Z
[SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent
duplicate fields
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]