GitHub user jmchung opened a pull request:
https://github.com/apache/spark/pull/18865
[SPARK-21610][SQL] Corrupt records are not handled properly when creating a
dataframe from a file
## What changes were proposed in this pull request?
```
echo '{"field": 1}
{"field": 2}
{"field": "3"}' >/tmp/sample.json
```
```scala
import org.apache.spark.sql.types._
val schema = new StructType()
.add("field", ByteType)
.add("_corrupt_record", StringType)
val file = "/tmp/sample.json"
val dfFromFile = spark.read.schema(schema).json(file)
scala> dfFromFile.show(false)
+-----+---------------+
|field|_corrupt_record|
+-----+---------------+
|1 |null |
|2 |null |
|null |{"field": "3"} |
+-----+---------------+
scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 0
scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 3
```
When the `requiredSchema` only contains `_corrupt_record`, the derived
`actualSchema` is empty and the `_corrupt_record` are all null for all rows.
When users requires only `_corrupt_record`, we assume that the corrupt records
are required for all json fields.
## How was this patch tested?
Added test case.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jmchung/spark SPARK-21610
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18865.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18865
----
commit 09aa76cc228162edba7ece45063592cd17ae4a27
Author: Jen-Ming Chung <[email protected]>
Date: 2017-08-07T03:52:45Z
[SPARK-21610][SQL] Corrupt records are not handled properly when creating a
dataframe from a file
commit f73c3874a9e6a35344a3dc8f6ec8cfb17a1be2f8
Author: Jen-Ming Chung <[email protected]>
Date: 2017-08-07T04:39:36Z
add explanation to schema change and minor refactor in test case
commit 7a595984f16f6c998883f271bf63e2e84af5f046
Author: Jen-Ming Chung <[email protected]>
Date: 2017-08-07T04:59:07Z
move test case from DataFrameReaderWriterSuite to JsonSuite
commit 97290f0f891f4261bf173c5ff596d0bb33168d57
Author: Jen-Ming Chung <[email protected]>
Date: 2017-08-07T05:41:15Z
filter not _corrupt_record in dataSchema
commit f5eec40d51bec8ed0f79f52c5a408ba98f26ca1a
Author: Jen-Ming Chung <[email protected]>
Date: 2017-08-07T06:17:48Z
code refactor
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]