[ 
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730869#comment-17730869
 ] 

王俊博 commented on SPARK-32559:
-----------------------------

The expressions `getByte(s) <= ' '` and `Character.isWhitespace(getByte(s))` 
are different, which causes the `trimAll()` function to fail to remove 
characters other than spaces, deviating from the original intention of the 
`trimAll` function (to be consistent with Java's String trim).

This will cause the `date()` function to fail, for example: executing `select 
date("today\u0003");` would result in an error instead of returning null.

```java
/**
* Trims whitespaces (\{@literal <=} ASCII 32) from both ends of this string.
*
* Note that, this method is the same as java's \{@link String#trim}, and 
different from
* \{@link UTF8String#trim()} which remove only spaces(= ASCII 32) from both 
ends.
*
* @return A UTF8String whose value is this UTF8String, with any leading and 
trailing white
* space removed, or this UTF8String if it has no leading or trailing whitespace.
*
*/
public UTF8String trimAll() {
int s = 0;
// skip all of the whitespaces (<=0x20) in the left side
while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
if (s == this.numBytes) {
// Everything trimmed
return EMPTY_UTF8;
}
// skip all of the whitespaces (<=0x20) in the right side
int e = this.numBytes - 1;
while (e > s && Character.isWhitespace(getByte(e))) e--;
if (s == 0 && e == numBytes - 1) {
// Nothing trimmed
return this;
}
return copyUTF8String(s, e);
}
```

> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters 
> correctly
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-32559
>                 URL: https://issues.apache.org/jira/browse/SPARK-32559
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: EdisonWang
>            Assignee: EdisonWang
>            Priority: Major
>              Labels: correctness
>             Fix For: 3.0.1
>
>
> The trim logic in Cast expression introduced in 
> [https://github.com/apache/spark/pull/26622] will trim chinese characters 
> unexpectly.
> For example,  sql  select cast("1中文" as float) gives 1 instead of null
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to