GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/20648

    [SPARK-23448][SQL] JSON parser should return partial row when part of 
columns are failed to parse under PermissiveMode

    ## What changes were proposed in this pull request?
    
    When we read JSON document with corrupted field under `PermissiveMode`:
    ```json
    {"attr1":"val1","attr2":"[\"val2\"]"}
    {"attr1":"val1","attr2":["val2"]}
    ```
    
    ```scala
    val schema = StructType(
      Seq(StructField("attr1", StringType, true),
          StructField("attr2", ArrayType(StringType, true), true)))
    
    spark.read.schema(schema).json(input).collect().foreach(println)
    ```
    
    We get this results currently:
    ```
    [null,null]
    [val1,WrappedArray(val2)]
    ```
    
    From `FailureSafeParser` and `BadRecordException`, seems there is the 
intention to return partial result for corrupted record. But the current 
implementation doesn't actually return partial result at all. As above example 
shows, all columns are null. This patch tries to fill the gap and returns 
partial result.
    
    ## How was this patch tested?
    
    Pass added tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-23448

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20648.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20648
    
----
commit 3d7d0415f2bfc2274fe94636b222d1ee437b0d24
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-02-20T14:03:49Z

    Returns partial row when part of columns are failed to parse.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to