numBytesForFirstByte() performance

GitBox Mon, 27 May 2019 09:01:59 -0700

kiszk commented on a change in pull request #24709: [SPARK-27841][SQL] Improve 
UTF8String to/fromString()/numBytesForFirstByte() performance
URL: https://github.com/apache/spark/pull/24709#discussion_r287837386


 ##########
 File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##########
 @@ -139,9 +139,55 @@ public static UTF8String fromAddress(Object base, long 
offset, int numBytes) {
    * Creates an UTF8String from String.
    */
   public static UTF8String fromString(String str) {
+    if (str == null) {
+      return null;
+    }
+    byte[] bytes;
+    // Optimization for ASCII strings: construct an exactly-right-sized
+    // array ourselves. CharsetEncoder ends up over-allocating for ASCII
+    // strings, leading to extra memory copies and garbage creation.
+    if (isAscii(str)) {
+      bytes = new byte[str.length()];
+      for (int i = 0; i < str.length(); i++) {
+        bytes[i] = (byte) str.charAt(i);
+      }
+    } else {
+      bytes = str.getBytes(StandardCharsets.UTF_8);
+    }
+    return fromBytes(bytes);
+  }
+
+  /**
+   * Pre-SPARK-27841 version of fromString(), retained for benchmarking 
purposes.
+   */
+  public static UTF8String fromStringSlow(String str) {
     return str == null ? null : 
fromBytes(str.getBytes(StandardCharsets.UTF_8));
   }
 
+  /**
+   * Returns true if the given string consists entirely of ASCII characters, 
false otherwise.
+   */
+  private static boolean isAscii(String str) {
+    for (int i = 0; i < str.length(); i++) {
+      if (str.charAt(i) > 127) { // unsigned comparison
+        return false;
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Returns true if the given UTF8 bytes consist entirely of ASCII 
characters, false otherwise.
+   */
+  private static boolean isAscii(byte[] bytes) {
 
 Review comment:
   I think that the HW branch predictor can reduce misprediction since the 
direction is mostly the same. If we assume that ASCII case is dominant, the 
compiler optimizer may apply simdization.
   
   If this code is faster than the original one, how about applying this to 
`isAscii(String str)`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] kiszk commented on a change in pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/numBytesForFirstByte() performance

Reply via email to