Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18865
@HyukjinKwon 's provided use case looks pretty fair. The corrupt record is
the whole line which doesn't follow the json format. It is kind of different to
the corrupt record case that some json fields can't be correctly converted to
desired data type.
This two kind of corrupt records can be mixed in one json file. E.g.,
echo '{"field": 1
{"field" 2}
{"field": 3}
{"field": "4"}' >/tmp/sample.json
scala> dfFromFile.show(false)
+-----+---------------+
|field|_corrupt_record|
+-----+---------------+
|null |{"field": 1 |
|null | {"field" 2} |
|3 |null |
|null |{"field": "4"} |
+-----+---------------+
scala> dfFromFile.select($"_corrupt_record").show()
+---------------+
|_corrupt_record|
+---------------+
| {"field": 1|
| {"field" 2}|
| null|
| null|
+---------------+
At least we should clearly explain the difference in the error message.
Maybe something like: The query to execute now requires only `_corrupt_record`
in effect after optimization. When there are corrupt records due to json field
conversion error, those corrupt records might not correctly generated in the
end, because now no other json fields are required along actually. In order to
obtain most accurate result, we recommend users to cache or save the dataset
before the queries requiring only `_corrupt_record`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]