Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/19180#discussion_r138989200
--- Diff:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -1097,10 +1100,23 @@ public UTF8String copy() {
@Override
public int compareTo(@Nonnull final UTF8String other) {
int len = Math.min(numBytes, other.numBytes);
- // TODO: compare 8 bytes as unsigned long
- for (int i = 0; i < len; i ++) {
+ int wordMax = len & ~7;
+ long roffset = other.offset;
+ Object rbase = other.base;
+ for (int i = 0; i < wordMax; i += 8) {
+ long left = getLong(base, offset + i);
+ long right = getLong(rbase, roffset + i);
+ if (left != right) {
+ if (IS_LITTLE_ENDIAN) {
+ return UnsignedLongs.compare(Long.reverseBytes(left),
Long.reverseBytes(right));
+ } else {
+ return UnsignedLongs.compare(left, right);
+ }
+ }
+ }
+ for (int i = wordMax; i < len; i++) {
// In UTF-8, the byte should be unsigned, so we should compare them
as unsigned int.
- int res = (getByte(i) & 0xFF) - (other.getByte(i) & 0xFF);
+ int res = (getByte(i) & 0xFF) - (Platform.getByte(rbase, roffset +
i) & 0xFF);
--- End diff --
Does this really make a difference? It's the same call but the other object
has also already calculated the base and offset. Hey maybe so, just wondering
if it's a benchmark issue or a real improvement.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]