Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19389#discussion_r152294412
--- Diff: docs/sql-programming-guide.md ---
@@ -1577,6 +1577,143 @@ options.
- Since Spark 2.3, the queries from raw JSON/CSV files are disallowed
when the referenced columns only include the internal corrupt record column
(named `_corrupt_record` by default). For example,
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`.
Instead, you can cache or save the parsed results and then send the same query.
For example, `val df = spark.read.schema(schema).json(file).cache()` and then
`df.filter($"_corrupt_record".isNotNull).count()`.
- The `percentile_approx` function previously accepted numeric type
input and output double type results. Now it supports date type, timestamp type
and numeric types as input types. The result type is also changed to be the
same as the input type, which is more reasonable for percentiles.
+ - Partition column inference previously found incorrect common type for
different inferred types, for example, previously it ended up with double type
as the common type for double type and date type. Now it finds the correct
common type for such conflicts. The conflict resolution follows the table below:
--- End diff --
Built doc shows as below:
<img width="1119" alt="2017-11-21 11 40 50"
src="https://user-images.githubusercontent.com/6477701/33078316-9cf4e24a-cf15-11e7-9e40-41b98e7f9358.png">
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]