cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r378054659
##########
File path: docs/sql-migration-guide.md
##########
@@ -49,7 +49,7 @@ license: |
- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal
to 0.0, but -0.0 and 0.0 are considered as different values when used in
aggregate grouping keys, window partition keys and join keys. Since Spark 3.0,
this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()`
returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and
earlier.
- - In Spark version 2.4 and earlier, users can create a map with duplicated
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior
of map with duplicated keys is undefined, e.g. map look up respects the
duplicated key appears first, `Dataset.collect` only keeps the duplicated key
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these
built-in functions will remove duplicated map keys with last wins policy. Users
may still read map values with duplicated keys from data sources which do not
enforce it (e.g. Parquet), the behavior will be undefined.
+ - In Spark version 2.4 and earlier, users can create a map with duplicated
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior
of map with duplicated keys is undefined, e.g. map look up respects the
duplicated key appears first, `Dataset.collect` only keeps the duplicated key
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the
default value `false`, Spark will throw RuntimeException while duplicated keys
are found. If set to `true`, these built-in functions will remove duplicated
map keys with last wins policy. Users may still read map values with duplicated
keys from data sources which do not enforce it (e.g. Parquet), the behavior
will be undefined.
Review comment:
What I have in mind is:
- if it's a bug fix (the previous result is definitely wrong), then no
config is needed. If the impact is big, we can add a legacy config which is
false by default.
- if it makes the behavior better, we should either add a config and use the
old behavior by default, or fail by default and ask users to set config
explicitly and pick the desired behavior.
I'm trying to think more cases, will send an email to the dev list soon.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]