[
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yangyang Gao updated SPARK-48973:
---------------------------------
Description:
In the spark the mask function when apply with a stirng contains invalid
character or wide character would cause unexpected behavior.
Example to use `*` mask a stirng contains wide-character {{🙂}}
{code:sql}
select mask("🙂", "Y", "y", "n", "*");
{code}
could cause result `**` instead of `*`. Looks spark mask treat {{🙂}} as 2
characters.
Example to use wide-character {{🙂}} do mask would cause wrong garbled code
problem
{code:sql}
select mask("ABC", "🙂");
{code}
result is `???`.
Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`,
`x`, `E`, `D`.
Looks spark mask can only handle BMP character (that is 16 bits) and can't
guarantee result for invalid UTC-8 character and wide-character when doing mask.
My question here is *does that the limitation / issue of spark mask function or
spark mask by design only handle for BMP character ?*
If it is a limitation of mask function, could spark address this part in mask
function document or comments ?
was:
In the spark the mask function when apply with a stirng contains invalid
character or wide character would cause unexpected behavior.
Example to use `*` mask a stirng contains wide-character {{🙂}}
{code:sql}
select mask("🙂", "Y", "y", "n", "*");
{code}
could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}.
Looks spark mask treat {{🙂}} as 2 characters.
Example to use wide-character {{🙂}} do mask would cause wrong garbled code
problem
{code:sql}
select mask("ABC", "🙂");
{code}
result is {{{}`???{}}}`.
Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`,
`x`, `E`, `D`.
Looks spark mask can only handle BMP character (that is 16 bits) and can't
guarantee result for invalid UTC-8 character and wide-character when doing mask.
My question here is *does that the limitation / issue of spark mask function or
spark mask by design only handle for BMP character ?*
If it is a limitation of mask function, could spark address this part in mask
function document or comments ?
> Unexpected behavior using spark mask function handle string contains invalid
> UTF-8 or wide character
> ----------------------------------------------------------------------------------------------------
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
> Issue Type: Question
> Components: SQL
> Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
> Reporter: Yangyang Gao
> Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{🙂}}
> {code:sql}
> select mask("🙂", "Y", "y", "n", "*");
> {code}
> could cause result `**` instead of `*`. Looks spark mask treat {{🙂}} as 2
> characters.
> Example to use wide-character {{🙂}} do mask would cause wrong garbled code
> problem
> {code:sql}
> select mask("ABC", "🙂");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't
> guarantee result for invalid UTC-8 character and wide-character when doing
> mask.
> My question here is *does that the limitation / issue of spark mask function
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask
> function document or comments ?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]