sarutak commented on a change in pull request #33287:
URL: https://github.com/apache/spark/pull/33287#discussion_r667380567



##########
File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
##########
@@ -574,14 +574,14 @@ public UTF8String trim() {
   public UTF8String trimAll() {
     int s = 0;
     // skip all of the whitespaces (<=0x20) in the left side
-    while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
+    while (s < this.numBytes && getByte(s) <= 0x20) s++;

Review comment:
       If we comply with the comment for `trimAll`, `<= 0x20` one seems 
correct. Here is the implementation of `String.trim`.
   
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/java/lang/StringLatin1.java#L531-L542
   
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/java/lang/StringUTF16.java#L847-L860
   
   Also, I noticed that originally, characters <= 0x20 were trimmed but #29375 
changed the behavior.
   That change seems to break the compatibility.
   `sql-migration-guide.md` says like as follows.
   ```
   In Spark 3.0, when casting string value to integral types(tinyint, smallint, 
int and bigint), datetime types(date, timestamp and interval) and boolean type, 
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before 
converted to these type values, for example, `cast(' 1\t' as int)` results `1`, 
`cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results 
the date value `2019-10-10`. In Spark version 2.4 and below, when casting 
string to integrals and booleans, it does not trim the whitespaces from both 
ends; the foregoing results is `null`, while to datetimes, only the trailing 
spaces (= ASCII 32) are removed.
   ```
   
   In fact,  select `cast('2019-10-10\b' as date);` returns `2019-10-10` in 
`Spark 3.0.0`. 
   But after `3.0.1`, the query returns `NULL.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to