[GitHub] [spark] xkrogen commented on a change in pull request #34267: [SPARK-36992][SQL] Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

GitBox Wed, 13 Oct 2021 09:11:58 -0700


xkrogen commented on a change in pull request #34267:
URL: https://github.com/apache/spark/pull/34267#discussion_r728229804




##########
File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
##########
@@ -42,15 +45,48 @@ public static void writeToMemory(byte[] src, Object target, 
long targetOffset) {
   public static long getPrefix(byte[] bytes) {
     if (bytes == null) {
       return 0L;
+    }
+    return getPrefix(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length);
+  }
+
+  public static long getPrefix(Object base, long offset, int numBytes) {
+    // Since JVMs are either 4-byte aligned or 8-byte aligned, we check the 
size of the bytes.
+    // If size is 0, just return 0.
+    // If size is between 0 and 4 (inclusive), assume data is 4-byte aligned 
under the hood and
+    // use a getInt to fetch the prefix.
+    // If size is greater than 4, assume we have at least 8 bytes of data to 
fetch.
+    // After getting the data, we use a mask to mask out data that is not part 
of the bytes.
+    long p;
+    long mask = 0;
+    if (IS_LITTLE_ENDIAN) {
+      if (numBytes >= 8) {
+        p = Platform.getLong(base, offset);
+      } else if (numBytes > 4) {
+        p = Platform.getLong(base, offset);
+        mask = (1L << (8 - numBytes) * 8) - 1;
+      } else if (numBytes > 0) {
+        p = (long) Platform.getInt(base, offset);
+        mask = (1L << (8 - numBytes) * 8) - 1;
+      } else {
+        p = 0;
+      }
+      p = java.lang.Long.reverseBytes(p);
     } else {
-      final int minLen = Math.min(bytes.length, 8);
-      long p = 0;
-      for (int i = 0; i < minLen; ++i) {
-        p |= ((long) Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i) & 
0xff)
-            << (56 - 8 * i);
+      // byteOrder == ByteOrder.BIG_ENDIAN
+      if (numBytes >= 8) {
+        p = Platform.getLong(base, offset);
+      } else if (numBytes > 4) {
+        p = Platform.getLong(base, offset);
+        mask = (1L << (8 - numBytes) * 8) - 1;
+      } else if (numBytes > 0) {
+        p = ((long) Platform.getInt(base, offset)) << 32;
+        mask = (1L << (8 - numBytes) * 8) - 1;
+      } else {
+        p = 0;
       }
-      return p;
     }
+    p &= ~mask;
+    return p;

Review comment:
       I see this code is just copied from `UTF8String`, but it seems like we 
could make it much more concise, and emphasize the differences between 
big/little endian more, by combining the branches in a different manner:
   ```java
       final long p;
       final long mask;
       if (numBytes >= 8) {
         p = Platform.getLong(base, offset);
         mask = 0;
       } else if (numBytes > 4) {
         p = Platform.getLong(base, offset);
         mask = (1L << (8 - numBytes) * 8) - 1;
       } else if (numBytes > 0) {
         long pRaw = Platform.getInt(base, offset);
         p = IS_LITTLE_ENDIAN ? pRaw : (pRaw << 32);
         mask = (1L << (8 - numBytes) * 8) - 1;
       } else {
         p = 0;
         mask = 0;
       }
       final long pBigEndian = IS_LITTLE_ENDIAN ? 
java.lang.Long.reverseBytes(p) : p;
       return pBigEndian & ~mask;
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xkrogen commented on a change in pull request #34267: [SPARK-36992][SQL] Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

Reply via email to