[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

jmchung Sun, 06 Aug 2017 23:20:08 -0700

GitHub user jmchung opened a pull request:

    https://github.com/apache/spark/pull/18865


    [SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

    ## What changes were proposed in this pull request?
    ```
    echo '{"field": 1}
    {"field": 2}
    {"field": "3"}' >/tmp/sample.json
    ```
    
    ```scala
    import org.apache.spark.sql.types._
    
    val schema = new StructType()
      .add("field", ByteType)
      .add("_corrupt_record", StringType)
    
    val file = "/tmp/sample.json"
    
    val dfFromFile = spark.read.schema(schema).json(file)
    
    scala> dfFromFile.show(false)
    +-----+---------------+
    |field|_corrupt_record|
    +-----+---------------+
    |1    |null           |
    |2    |null           |
    |null |{"field": "3"} |
    +-----+---------------+
    
    scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
    res1: Long = 0
    
    scala> dfFromFile.filter($"_corrupt_record".isNull).count()
    res2: Long = 3
    ```
    When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
When users requires only `_corrupt_record`, we assume that the corrupt records 
are required for all json fields.
    
    ## How was this patch tested?
    
    Added test case.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jmchung/spark SPARK-21610

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18865.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18865
    
----
commit 09aa76cc228162edba7ece45063592cd17ae4a27
Author: Jen-Ming Chung <[email protected]>
Date:   2017-08-07T03:52:45Z

    [SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

commit f73c3874a9e6a35344a3dc8f6ec8cfb17a1be2f8
Author: Jen-Ming Chung <[email protected]>
Date:   2017-08-07T04:39:36Z

    add explanation to schema change and minor refactor in test case

commit 7a595984f16f6c998883f271bf63e2e84af5f046
Author: Jen-Ming Chung <[email protected]>
Date:   2017-08-07T04:59:07Z

    move test case from DataFrameReaderWriterSuite to JsonSuite

commit 97290f0f891f4261bf173c5ff596d0bb33168d57
Author: Jen-Ming Chung <[email protected]>
Date:   2017-08-07T05:41:15Z

    filter not _corrupt_record in dataSchema

commit f5eec40d51bec8ed0f79f52c5a408ba98f26ca1a
Author: Jen-Ming Chung <[email protected]>
Date:   2017-08-07T06:17:48Z

    code refactor

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

Reply via email to