JoshRosen commented on a change in pull request #24709: [SPARK-27841][SQL]
Improve UTF8String to/fromString()/numBytesForFirstByte() performance
URL: https://github.com/apache/spark/pull/24709#discussion_r287611422
##########
File path:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
##########
@@ -139,9 +139,55 @@ public static UTF8String fromAddress(Object base, long
offset, int numBytes) {
* Creates an UTF8String from String.
*/
public static UTF8String fromString(String str) {
+ if (str == null) {
+ return null;
+ }
+ byte[] bytes;
+ // Optimization for ASCII strings: construct an exactly-right-sized
+ // array ourselves. CharsetEncoder ends up over-allocating for ASCII
+ // strings, leading to extra memory copies and garbage creation.
+ if (isAscii(str)) {
+ bytes = new byte[str.length()];
+ for (int i = 0; i < str.length(); i++) {
+ bytes[i] = (byte) str.charAt(i);
+ }
+ } else {
+ bytes = str.getBytes(StandardCharsets.UTF_8);
+ }
+ return fromBytes(bytes);
+ }
+
+ /**
+ * Pre-SPARK-27841 version of fromString(), retained for benchmarking
purposes.
+ */
+ public static UTF8String fromStringSlow(String str) {
return str == null ? null :
fromBytes(str.getBytes(StandardCharsets.UTF_8));
}
+ /**
+ * Returns true if the given string consists entirely of ASCII characters,
false otherwise.
+ */
+ private static boolean isAscii(String str) {
+ for (int i = 0; i < str.length(); i++) {
+ if (str.charAt(i) > 127) { // unsigned comparison
+ return false;
+ }
+ }
+ return true;
+ }
+
+ /**
+ * Returns true if the given UTF8 bytes consist entirely of ASCII
characters, false otherwise.
+ */
+ private static boolean isAscii(byte[] bytes) {
Review comment:
Good idea! I think we can go one step further and move the `& 0x80` out of
the loop body and into the final `return` statement, shaving a couple of
instructions out of the loop body.
Just pushed a change implementing this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]