GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/20648
[SPARK-23448][SQL] JSON parser should return partial row when part of
columns are failed to parse under PermissiveMode
## What changes were proposed in this pull request?
When we read JSON document with corrupted field under `PermissiveMode`:
```json
{"attr1":"val1","attr2":"[\"val2\"]"}
{"attr1":"val1","attr2":["val2"]}
```
```scala
val schema = StructType(
Seq(StructField("attr1", StringType, true),
StructField("attr2", ArrayType(StringType, true), true)))
spark.read.schema(schema).json(input).collect().foreach(println)
```
We get this results currently:
```
[null,null]
[val1,WrappedArray(val2)]
```
From `FailureSafeParser` and `BadRecordException`, seems there is the
intention to return partial result for corrupted record. But the current
implementation doesn't actually return partial result at all. As above example
shows, all columns are null. This patch tries to fill the gap and returns
partial result.
## How was this patch tested?
Pass added tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-23448
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20648.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20648
----
commit 3d7d0415f2bfc2274fe94636b222d1ee437b0d24
Author: Liang-Chi Hsieh <viirya@...>
Date: 2018-02-20T14:03:49Z
Returns partial row when part of columns are failed to parse.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]