Konjac created SPARK-54195:
------------------------------

             Summary: Inconsistent behavior of str_to_map when handling 
delimiter regular expressions across different input collations
                 Key: SPARK-54195
                 URL: https://issues.apache.org/jira/browse/SPARK-54195
             Project: Spark
          Issue Type: Question
          Components: SQL
    Affects Versions: 4.0.1
            Reporter: Konjac


According to https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map
*str_to_map* can accept delimiters in regular expression. Here is an example

 
{code:java}
spark-sql (default)> SELECT str_to_map('a=1,b->2,c=3', ',', '=|->');
{"a":"1","b":"2","c":"3"} {code}
If I change the collation to UTF8_BINARY, it still works as the spec.

 

 
{code:java}
spark-sql (default)> SELECT str_to_map('a=1,b->2,c=3' COLLATE UTF8_BINARY, ',', 
'=|->');
{"a":"1","b":"2","c":"3"} {code}
However, if using UTF8_LCASE, it will fail to parse the string as map

 

 
{code:java}
spark-sql (default)> SELECT str_to_map('a=1,b->2,c=3' COLLATE UTF8_LCASE, ',', 
'=|->');
{"a=1":null,"b->2":null,"c=3":null} {code}
This behavior is inconsistent. I believe this is because of the current 
implementation of CollationAwareUTF8String.splitSQL

I wonder if UTF8_LCASE should also be implemented like UTF8_BINARY. Could 
anyone help to confirm this before I can help on possible changes?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to