[GitHub] [spark] JoshRosen commented on a change in pull request #34310: [SPARK-37037][SQL] Improve byte array sort by unify compareTo function of UTF8String and ByteArray

GitBox Mon, 18 Oct 2021 11:37:12 -0700


JoshRosen commented on a change in pull request #34310:
URL: https://github.com/apache/spark/pull/34310#discussion_r731210073




##########
File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
##########
@@ -75,6 +75,42 @@ static long getPrefix(Object base, long offset, int 
numBytes) {
     return (IS_LITTLE_ENDIAN ? java.lang.Long.reverseBytes(p) : p) & ~mask;
   }
 
+  public static int compareBinary(byte[] leftBase, byte[] rightBase) {
+    return compareBinary(leftBase, Platform.BYTE_ARRAY_OFFSET, leftBase.length,

Review comment:
       +1. 
   
   It seems plausible that the new version will be faster, but it's probably a 
good idea to run a quick benchmark to confirm. There's a `UTF8StringBenchmark` 
linked from https://github.com/apache/spark/pull/19180#issuecomment-329467838 : 
maybe we could adapt that to work on byte arrays and do a quick 
before-and-after comparison to just to double check?
   
   **Edit**: just to clarify: I noticed that this benchmark is also linked in 
the PR description. As Sean points out, I think the key difference in this PR 
is whether we're using `getByte()` versus directly accessing the on-heap byte 
array (in the linked UTF8String benchmark, both the old and new code were using 
`getByte()`.

##########
File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
##########
@@ -75,6 +75,42 @@ static long getPrefix(Object base, long offset, int 
numBytes) {
     return (IS_LITTLE_ENDIAN ? java.lang.Long.reverseBytes(p) : p) & ~mask;
   }
 
+  public static int compareBinary(byte[] leftBase, byte[] rightBase) {
+    return compareBinary(leftBase, Platform.BYTE_ARRAY_OFFSET, leftBase.length,

Review comment:
       +1. 
   
   It seems plausible that the new version will be faster, but it's probably a 
good idea to run a quick benchmark to confirm. There's a `UTF8StringBenchmark` 
linked from https://github.com/apache/spark/pull/19180#issuecomment-329467838 : 
maybe we could adapt that to work on byte arrays and do a quick 
before-and-after comparison to just to double check?
   
   **Edit**: just to clarify: I noticed that this benchmark is also linked in 
the PR description. As Sean points out, I think the key difference in this PR 
is whether we're using `getByte()` versus directly accessing the on-heap byte 
array (in the linked UTF8String benchmark, both the old and new code were using 
`getByte()`0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] JoshRosen commented on a change in pull request #34310: [SPARK-37037][SQL] Improve byte array sort by unify compareTo function of UTF8String and ByteArray

Reply via email to