Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23141#discussion_r236200834
  
    --- Diff: docs/sql-migration-guide-upgrade.md ---
    @@ -17,14 +17,16 @@ displayTitle: Spark SQL Upgrading Guide
     
       - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
     
    -  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`. 
    +  - In Spark version 2.4 and earlier, the `from_json` function produces 
`null`s for JSON strings and JSON datasource skips the same independetly of its 
mode if there is no valid root JSON token in its input (` ` for example). Since 
Spark 3.0, such input is treated as a bad record and handled according to 
specified mode. For example, in the `PERMISSIVE` mode the ` ` input is 
converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
     
       - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
     
       - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can still read map values with map type key from data 
source or Java/Scala collections, though they are not very useful.
    -  
    +
       - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a 
grouped dataset with key attribute wrongly named as "value", if the key is 
non-struct type, e.g. int, string, array, etc. This is counterintuitive and 
makes the schema of aggregation queries weird. For example, the schema of 
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the 
grouping attribute to "key". The old behaviour is preserved under a newly added 
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a 
default value of `false`.
     
    +  - In Spark version 2.4 and earlier, float/double -0.0 is semantically 
equal to 0.0, but users can still distinguish them via `Dataset.show`, 
`Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 
internally, and users can't distinguish them any more.
    --- End diff --
    
    I ran few simple queries on Hive 2.1.
    
    Simple comparison seems ok:
    
    ```
    hive> select 1 where 0.0=-0.0;
    OK
    1
    Time taken: 0.047 seconds, Fetched: 1 row(s)
    hive> select 1 where -0.0<0.0;
    OK
    Time taken: 0.053 seconds
    ```
    
    But group by behavior seems not correct:
    ```
    hive> select * from test;
    OK
    0.0
    -0.0
    0.0
    Time taken: 0.11 seconds, Fetched: 3 row(s)
    hive> select * from test;
    OK
    0.0
    -0.0
    0.0
    Time taken: 0.11 seconds, Fetched: 3 row(s)
    hive> select a, count(*) from test group by a;
    -0.0        3
    Time taken: 1.308 seconds, Fetched: 1 row(s)
    ```
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to