[
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730869#comment-17730869
]
王俊博 commented on SPARK-32559:
-----------------------------
The expressions `getByte(s) <= ' '` and `Character.isWhitespace(getByte(s))`
are different, which causes the `trimAll()` function to fail to remove
characters other than spaces, deviating from the original intention of the
`trimAll` function (to be consistent with Java's String trim).
This will cause the `date()` function to fail, for example: executing `select
date("today\u0003");` would result in an error instead of returning null.
```java
/**
* Trims whitespaces (\{@literal <=} ASCII 32) from both ends of this string.
*
* Note that, this method is the same as java's \{@link String#trim}, and
different from
* \{@link UTF8String#trim()} which remove only spaces(= ASCII 32) from both
ends.
*
* @return A UTF8String whose value is this UTF8String, with any leading and
trailing white
* space removed, or this UTF8String if it has no leading or trailing whitespace.
*
*/
public UTF8String trimAll() {
int s = 0;
// skip all of the whitespaces (<=0x20) in the left side
while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
if (s == this.numBytes) {
// Everything trimmed
return EMPTY_UTF8;
}
// skip all of the whitespaces (<=0x20) in the right side
int e = this.numBytes - 1;
while (e > s && Character.isWhitespace(getByte(e))) e--;
if (s == 0 && e == numBytes - 1) {
// Nothing trimmed
return this;
}
return copyUTF8String(s, e);
}
```
> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters
> correctly
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-32559
> URL: https://issues.apache.org/jira/browse/SPARK-32559
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: EdisonWang
> Assignee: EdisonWang
> Priority: Major
> Labels: correctness
> Fix For: 3.0.1
>
>
> The trim logic in Cast expression introduced in
> [https://github.com/apache/spark/pull/26622] will trim chinese characters
> unexpectly.
> For example, sql select cast("1中文" as float) gives 1 instead of null
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]