Github user dm-tran commented on a diff in the pull request:
https://github.com/apache/spark/pull/18865#discussion_r136047628
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
---
@@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with
DataSourceRegister {
}
(file: PartitionedFile) => {
- val parser = new JacksonParser(actualSchema, parsedOptions)
+ // SPARK-21610: when the `requiredSchema` only contains
`_corrupt_record`,
--- End diff --
> We need to let users know that _corrupted_record is a derived column from
other columns and cannot be selected alone in a query.
@viirya I created issue https://issues.apache.org/jira/browse/SPARK-21610
and need to select field "_corrupt_record" alone. This is possible with spark
2.2 (if a dataframe is created from a RDD) and it would be great to keep this
behaviour in future versions of Spark.
My use case is the following one: a spark job reads JSON with an input
schema, and will:
- save records that match the input schema in parquet format
- save "corrupt records" (invalid JSON or records that do match the input
schema) to text files, in a separate folder.
Basically, I want :
- a folder with clean data in parquet format
- another folder with "corrupt records". I can then analyze corrupt records
and for instance tell partners that they are sending invalid data. This enables
a clean data pipeline that separates valid records from corrupt records
To get valid and corrupt records, I write :
```
val validRecords = df.filter(col("_corrupt_record").isNull)
.drop("_corrupt_record")
val corruptRecords = df.filter(col("_corrupt_record").isNotNull)
.select("_corrupt_record")
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]