[ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
---------------------------------
    Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `{{{}*`{}}} mask a stirng contains wide-character {{🙂}}

```sql
select mask("🙂", "Y", "y", "n", "*");
```

could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{🙂}} as 2 characters.

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem

```sql
select mask("ABC", "🙂");
```
result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

```sql
select mask("\xED");
```

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `{{{}*`{}}} mask a stirng contains wide-character {{🙂}}

```
select mask("🙂", "Y", "y", "n", "*");
```

could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{🙂}} as 2 characters.

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem

```
select mask("ABC", "🙂");
```
result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

```
 select mask("\xED");
```

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-48973
>                 URL: https://issues.apache.org/jira/browse/SPARK-48973
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
>         Environment: Ubuntu 22.04
>            Reporter: Yangyang Gao
>            Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `{{{}*`{}}} mask a stirng contains wide-character {{🙂}}
> ```sql
> select mask("🙂", "Y", "y", "n", "*");
> ```
> could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
> Looks spark mask treat {{🙂}} as 2 characters.
> Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
> problem
> ```sql
> select mask("ABC", "🙂");
> ```
> result is {{{}`???{}}}`.
> Example to mask a string contains a invalid UTF-8 character
> ```sql
> select mask("\xED");
> ```
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to