[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464245#comment-16464245
 ] 

Bruce Robbins commented on SPARK-23936:
---------------------------------------

[~ueshin]

I have a question about map_concat's behavior as it pertains to this part of 
the function description: "If a key is found in multiple given maps, that key’s 
value in the resulting map comes from the last one of those maps."

Spark maps can have duplicate keys, e.g.:
{noformat}
scala> val df = sql("select map('a', 1, 'a', 2, 'b', 3, 'c', 10) as map1, 
map('a', 7, 'b', 8, 'b', 9) as map2")
scala> df.show(truncate=false)
+---------------------------------+------------------------+
|map1                             |map2                    |
+---------------------------------+------------------------+
|[a -> 1, a -> 2, b -> 3, c -> 10]|[a -> 7, b -> 8, b -> 9]|
+---------------------------------+------------------------+
{noformat}
I'm not sure the duplicate handling part of the description makes sense for 
maps that allow duplicate keys.

I can think of 3 ways of handling the duplicate key handling requirement:

Scheme #1: Ignore it. map_concat would be a pure concantenation. Using the 
above example maps:
{noformat}
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+---------------------------------------------------------+
|map_concat(map1, map2)                                   |
+---------------------------------------------------------+
|[a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9]|
+---------------------------------------------------------+
{noformat}
Duplicate keys are preserved from the original maps, and, in this example, 
additional duplicates are introduced.

Scheme #2: Preserve duplicates within input maps, but still pick a winner 
across maps. That is, treat the maps like so:
{noformat}
map1:
a -> [1, 2]
b -> [3]
c -> [10]

map2:
a -> [7]
b -> [8, 9]
{noformat}
Then use the rule that the key's value comes from the last map in which the key 
appears:
{noformat}
resulting map
a -> [7]    // from map2
b -> [8, 9] // from map2
c -> [10]   // from map1
{noformat}
In Spark, it would look like this:
{noformat}
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+---------------------------------+
|map_concat(map1, map2)           |
+---------------------------------+
|[a -> 7, b -> 8, b -> 9, c -> 10]|
+---------------------------------+
{noformat}
Scheme #3: Don't allow any duplicates in the resulting map. That is, treat the 
input maps collectively as a stream of tuples, and keep only the last value for 
_any_ key:
{noformat}
a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9
        ^                        ^       ^       ^
        |                        |       |       |
     overwrites               overwrites |    overwrites
       a -> 1                   a -> 2   |      b -> 8
                                     overwrites
                                       b -> 3

scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+-------------------------+
|map_concat(map1, map2)   |
+-------------------------+
|[a -> 7, b -> 9, c -> 10]|
+-------------------------+
{noformat}
Note: This is what I've actually implemented in my PR. It made sense to me due 
to the requirement that we pick a winner across maps. But I wasn't aware then 
that the source maps could have duplicates.

As a wrinkle to this, spark-sql, for some reason, eliminates duplicates in maps 
on display:
{noformat}
spark-sql> select map1, map2 from mapsWithDupKeys;
{"a":2,"b":3,"c":10}    {"a":7,"b":9}
Time taken: 0.147 seconds, Fetched 1 row(s)
spark-sql> select map_keys(map1) from mapsWithDupKeys;
["a","a","b","c"]
Time taken: 0.093 seconds, Fetched 1 row(s)
{noformat}

> High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → 
> map<K,V>
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-23936
>                 URL: https://issues.apache.org/jira/browse/SPARK-23936
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Xiao Li
>            Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to