maropu commented on a change in pull request #27478: [SPARK-25829][SQL] Add
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376262398
##########
File path: docs/sql-migration-guide.md
##########
@@ -49,7 +49,7 @@ license: |
- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal
to 0.0, but -0.0 and 0.0 are considered as different values when used in
aggregate grouping keys, window partition keys and join keys. Since Spark 3.0,
this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()`
returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and
earlier.
- - In Spark version 2.4 and earlier, users can create a map with duplicated
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior
of map with duplicated keys is undefined, e.g. map look up respects the
duplicated key appears first, `Dataset.collect` only keeps the duplicated key
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these
built-in functions will remove duplicated map keys with last wins policy. Users
may still read map values with duplicated keys from data sources which do not
enforce it (e.g. Parquet), the behavior will be undefined.
+ - In Spark version 2.4 and earlier, users can create a map with duplicated
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior
of map with duplicated keys is undefined, e.g. map look up respects the
duplicated key appears first, `Dataset.collect` only keeps the duplicated key
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the
default value `false`, Spark will throw RuntimeException while duplicated keys
are found. If set to `true`, these built-in functions will remove duplicated
map keys with last wins policy. Users may still read map values with duplicated
keys from data sources which do not enforce it (e.g. Parquet), the behavior
will be undefined.
Review comment:
I agree with the idea to avoid "silent result changing". Btw, we couldn't
keep the old (spark 2.4) behaviour for duplicate keys by using a legacy option?
If we couldn't do because of some reasons, the proposed one (the runtime
exception) looks reasonable to me, too.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]