[ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
---------------------------------
    Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.

 

Example to use `{{{}*`{}}} mask a stirng contains wide-character {{🙂}}

```

select mask("🙂", "Y", "y", "n", "*");

```

could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{🙂}} as 2 characters.

 

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem

```

select mask("ABC", "🙂");

```

result is {{{}`???{}}}`.

 

Example to mask a string contains a invalid UTF-8 character

 

```

 select mask("\xED");

```

 

result is {{xXX}} instead of {{{}\xED{}}}, looks spark treat it as four 
character {{{{}}{}}}, {{{}x{}}}, {{{}E{}}}, {{{}D{}}}.

 

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

 

 

My question here is *{*}does that the limitation / issue of spark mask function 
or spark mask by design only handle for BMP character ?{*}*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.

 

Example to use `{{{}*`{}}} mask a stirng contains wide-character {{🙂}}

```

{{select mask("🙂", "Y", "y", "n", "*");}}

```

could cause result `{{{}**`{}}} instead of `{{{}*`{}}}. Looks spark mask treat 
{{🙂}} as 2 characters.

 

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem

```

{{select mask("ABC", "🙂");}}

```

result is {{{}`???{}}}`.

 

Example to mask a string contains a invalid UTF-8 character

 

```

{{ select mask("\xED");}}

```

 

result is {{xXX}} instead of {{{}\xED{}}}, looks spark treat it as four 
character {{{}\{}}}, {{{}x{}}}, {{{}E{}}}, {{{}D{}}}.

 

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

 

 

My question here is **does that the limitation / issue of spark mask function 
or spark mask by design only handle for BMP character ?**

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-48973
>                 URL: https://issues.apache.org/jira/browse/SPARK-48973
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
>         Environment: Ubuntu 22.04
>            Reporter: Yangyang Gao
>            Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
>  
> Example to use `{{{}*`{}}} mask a stirng contains wide-character {{🙂}}
> ```
> select mask("🙂", "Y", "y", "n", "*");
> ```
> could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
> Looks spark mask treat {{🙂}} as 2 characters.
>  
> Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
> problem
> ```
> select mask("ABC", "🙂");
> ```
> result is {{{}`???{}}}`.
>  
> Example to mask a string contains a invalid UTF-8 character
>  
> ```
>  select mask("\xED");
> ```
>  
> result is {{xXX}} instead of {{{}\xED{}}}, looks spark treat it as four 
> character {{{{}}{}}}, {{{}x{}}}, {{{}E{}}}, {{{}D{}}}.
>  
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
>  
>  
> My question here is *{*}does that the limitation / issue of spark mask 
> function or spark mask by design only handle for BMP character ?{*}*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to