maropu commented on a change in pull request #27478: [SPARK-25829][SQL] Add 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the 
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376262398
 
 

 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal 
to 0.0, but -0.0 and 0.0 are considered as different values when used in 
aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, 
this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` 
returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and 
earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these 
built-in functions will remove duplicated map keys with last wins policy. Users 
may still read map values with duplicated keys from data sources which do not 
enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the 
default value `false`, Spark will throw RuntimeException while duplicated keys 
are found. If set to `true`, these built-in functions will remove duplicated 
map keys with last wins policy. Users may still read map values with duplicated 
keys from data sources which do not enforce it (e.g. Parquet), the behavior 
will be undefined.
 
 Review comment:
   I agree with the idea to avoid  "silent result changing". Btw, we couldn't 
keep the old (spark 2.4) behaviour for duplicate keys by using a legacy option? 
If we couldn't do because of some reasons, the proposed one (the runtime 
exception) looks reasonable to me, too. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to