Konjac created SPARK-54195:
------------------------------
Summary: Inconsistent behavior of str_to_map when handling
delimiter regular expressions across different input collations
Key: SPARK-54195
URL: https://issues.apache.org/jira/browse/SPARK-54195
Project: Spark
Issue Type: Question
Components: SQL
Affects Versions: 4.0.1
Reporter: Konjac
According to https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map
*str_to_map* can accept delimiters in regular expression. Here is an example
{code:java}
spark-sql (default)> SELECT str_to_map('a=1,b->2,c=3', ',', '=|->');
{"a":"1","b":"2","c":"3"} {code}
If I change the collation to UTF8_BINARY, it still works as the spec.
{code:java}
spark-sql (default)> SELECT str_to_map('a=1,b->2,c=3' COLLATE UTF8_BINARY, ',',
'=|->');
{"a":"1","b":"2","c":"3"} {code}
However, if using UTF8_LCASE, it will fail to parse the string as map
{code:java}
spark-sql (default)> SELECT str_to_map('a=1,b->2,c=3' COLLATE UTF8_LCASE, ',',
'=|->');
{"a=1":null,"b->2":null,"c=3":null} {code}
This behavior is inconsistent. I believe this is because of the current
implementation of CollationAwareUTF8String.splitSQL
I wonder if UTF8_LCASE should also be implemented like UTF8_BINARY. Could
anyone help to confirm this before I can help on possible changes?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]