GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/13546

    [SPARK-15808] [SQL] File Format Checking When Appending Data

    #### What changes were proposed in this pull request?
    **Issue:** Got wrong results or strange errors when append data to a table 
with mismatched file format. 
    
    _Example 1: PARQUET -> CSV_
    ```Scala
    createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
    createDF(10, 
19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
    ```
    
    Error we got: 
    ```
    Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
java.lang.RuntimeException: 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-00000-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
found [79, 82, 67, 23]
    ```
    
    _Example 2: Json -> CSV_
    ```Scala
    createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
    createDF(10, 
19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
    ```
    
    No exception, but wrong results:
    ```
    +----+----+
    |  c1|  c2|
    +----+----+
    |null|null|
    |null|null|
    |null|null|
    |null|null|
    |   0|str0|
    |   1|str1|
    |   2|str2|
    |   3|str3|
    |   4|str4|
    |   5|str5|
    |   6|str6|
    |   7|str7|
    |   8|str8|
    |   9|str9|
    +----+----+
    ```
    _Example 3: Json -> Text_
    ```Scala
    createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
    createDF(10, 
19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
    ```
    
    Error we got: 
    ```
    Text data source supports only a single column, and you have 2 columns.
    ```
    
    This PR is to issue an exception with appropriate error messages.
    
    #### How was this patch tested?
    Added test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark fileFormatCheck

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13546.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13546
    
----
commit 9f70a7de7387a4b80913b92c308061620eec2a45
Author: gatorsmile <[email protected]>
Date:   2016-06-07T18:35:36Z

    file format checking

commit 74ac6d956a80330fa0a5d8d62b5f3569b4179321
Author: gatorsmile <[email protected]>
Date:   2016-06-07T18:44:02Z

    update the test cases.

commit 9d9d2632fde85c62b28454f33b24f7ee8fb6f15e
Author: gatorsmile <[email protected]>
Date:   2016-06-07T19:02:36Z

    update the test cases.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to