coderfender commented on code in PR #21720:
URL: https://github.com/apache/datafusion/pull/21720#discussion_r3107149796


##########
datafusion/spark/src/function/map/utils.rs:
##########
@@ -202,17 +202,20 @@ fn map_deduplicate_keys(
                         cur_keys_offset + cur_entry_idx,
                     )?
                     .compacted();
+                    // Enforce Spark's default 
`spark.sql.mapKeyDedupPolicy=EXCEPTION`.
+                    // Native LAST_WIN support is deferred to a follow-up.
                     if seen_keys.contains(&key) {
-                        // TODO: implement configuration and logic for 
spark.sql.mapKeyDedupPolicy=EXCEPTION (this is default spark-config)
-                        // exec_err!("invalid argument: duplicate keys in map")
-                        // 
https://github.com/apache/spark/blob/cf3a34e19dfcf70e2d679217ff1ba21302212472/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4961
-                    } else {
-                        // This code implements deduplication logic for 
spark.sql.mapKeyDedupPolicy=LAST_WIN (this is NOT default spark-config)
-                        keys_mask_one[cur_entry_idx] = true;
-                        values_mask_one[cur_entry_idx] = true;
-                        seen_keys.insert(key);
-                        new_last_offset += 1;
+                        return exec_err!(
+                            "[DUPLICATED_MAP_KEY] Duplicate map key {key} was 
found, \
+                             please check the input data. If you want to 
remove the \
+                             duplicated keys, you can set 
spark.sql.mapKeyDedupPolicy \

Review Comment:
   We might want to keep the error message but might make the error more geared 
towards DF 



##########
datafusion/sqllogictest/test_files/spark/map/map_from_entries.slt:
##########
@@ -151,14 +151,12 @@ SELECT
 ----
 {outer_key1: {inner_a: 1, inner_b: 2}, outer_key2: {inner_x: 10, inner_y: 20, 
inner_z: 30}}
 
-# Test with duplicate keys
-query ?
+# Test with duplicate keys: raises DUPLICATED_MAP_KEY under Spark's default 
policy
+query error DataFusion error: Execution error: \[DUPLICATED_MAP_KEY\] 
Duplicate map key true was found
 SELECT map_from_entries(array(
-    struct(true, 'a'), 
-    struct(false, 'b'), 
+    struct(true, 'a'),
+    struct(false, 'b'),
     struct(true, 'c'),
-    struct(false, cast(NULL as string)), 
+    struct(false, cast(NULL as string)),
     struct(true, 'd')

Review Comment:
   Might want to revert unwanted formatting changes here 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to