[
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464245#comment-16464245
]
Bruce Robbins commented on SPARK-23936:
---------------------------------------
[~ueshin]
I have a question about map_concat's behavior as it pertains to this part of
the function description: "If a key is found in multiple given maps, that key’s
value in the resulting map comes from the last one of those maps."
Spark maps can have duplicate keys, e.g.:
{noformat}
scala> val df = sql("select map('a', 1, 'a', 2, 'b', 3, 'c', 10) as map1,
map('a', 7, 'b', 8, 'b', 9) as map2")
scala> df.show(truncate=false)
+---------------------------------+------------------------+
|map1 |map2 |
+---------------------------------+------------------------+
|[a -> 1, a -> 2, b -> 3, c -> 10]|[a -> 7, b -> 8, b -> 9]|
+---------------------------------+------------------------+
{noformat}
I'm not sure the duplicate handling part of the description makes sense for
maps that allow duplicate keys.
I can think of 3 ways of handling the duplicate key handling requirement:
Scheme #1: Ignore it. map_concat would be a pure concantenation. Using the
above example maps:
{noformat}
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+---------------------------------------------------------+
|map_concat(map1, map2) |
+---------------------------------------------------------+
|[a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9]|
+---------------------------------------------------------+
{noformat}
Duplicate keys are preserved from the original maps, and, in this example,
additional duplicates are introduced.
Scheme #2: Preserve duplicates within input maps, but still pick a winner
across maps. That is, treat the maps like so:
{noformat}
map1:
a -> [1, 2]
b -> [3]
c -> [10]
map2:
a -> [7]
b -> [8, 9]
{noformat}
Then use the rule that the key's value comes from the last map in which the key
appears:
{noformat}
resulting map
a -> [7] // from map2
b -> [8, 9] // from map2
c -> [10] // from map1
{noformat}
In Spark, it would look like this:
{noformat}
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+---------------------------------+
|map_concat(map1, map2) |
+---------------------------------+
|[a -> 7, b -> 8, b -> 9, c -> 10]|
+---------------------------------+
{noformat}
Scheme #3: Don't allow any duplicates in the resulting map. That is, treat the
input maps collectively as a stream of tuples, and keep only the last value for
_any_ key:
{noformat}
a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9
^ ^ ^ ^
| | | |
overwrites overwrites | overwrites
a -> 1 a -> 2 | b -> 8
overwrites
b -> 3
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+-------------------------+
|map_concat(map1, map2) |
+-------------------------+
|[a -> 7, b -> 9, c -> 10]|
+-------------------------+
{noformat}
Note: This is what I've actually implemented in my PR. It made sense to me due
to the requirement that we pick a winner across maps. But I wasn't aware then
that the source maps could have duplicates.
As a wrinkle to this, spark-sql, for some reason, eliminates duplicates in maps
on display:
{noformat}
spark-sql> select map1, map2 from mapsWithDupKeys;
{"a":2,"b":3,"c":10} {"a":7,"b":9}
Time taken: 0.147 seconds, Fetched 1 row(s)
spark-sql> select map_keys(map1) from mapsWithDupKeys;
["a","a","b","c"]
Time taken: 0.093 seconds, Fetched 1 row(s)
{noformat}
> High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) →
> map<K,V>
> -----------------------------------------------------------------------------------
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Xiao Li
> Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given
> maps, that key’s value in the resulting map comes from the last one of those
> maps.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]