[GitHub] spark pull request #22378: [SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STO...

dongjoon-hyun Sun, 09 Sep 2018 22:31:00 -0700

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/22378


    [SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent 
duplicate fields

    ## What changes were proposed in this pull request?
    
    Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
STORED AS` should not generate files with duplicate fields because Spark cannot 
read those files back.
    
    **INSERT OVERWRITE DIRECTORY USING**
    ```scala
    scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
SELECT 'id', 'id2' id")
    ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
    org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
inserting into file:/tmp/parquet: `id`;
    ```
    
    **INSERT OVERWRITE DIRECTORY STORED AS**
    ```scala
    scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
parquet SELECT 'id', 'id2' id")
    // It generates corrupted files
    scala> spark.read.parquet("/tmp/parquet").show
    18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
schema and the partition schema: `id`;
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with newly added test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-25389

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22378.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22378
    
----
commit 02425765d243bdd4aeaccd71851b0457be108690
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-09-10T05:24:33Z

    [SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent 
duplicate fields

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22378: [SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STO...

Reply via email to