[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

gatorsmile Fri, 08 Sep 2017 21:15:06 -0700

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18865#discussion_r137917938
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1542,6 +1542,10 @@ options.
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.2 to 2.3
    +
    +  - The queries which select only `spark.sql.columnNameOfCorruptRecord` 
column are disallowed now. Notice that the queries which have only the column 
after column pruning (e.g. filtering on the column followed by a counting 
operation) are also disallowed. If you want to select only the corrupt records, 
you should cache or save the underlying Dataset and DataFrame before running 
such queries.
    --- End diff --
    
    > Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
the referenced columns only include the internal corrupt record column (named 
`_corrupt_column` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then
    `df.filter($"_corrupt_record".isNotNull).count()`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

Reply via email to