Github user original-brownbear commented on the issue:
https://github.com/apache/spark/pull/19180
@srowen played a little with
https://github.com/original-brownbear/spark/blob/string-compareto-benchmark-both/core/src/test/scala/org/apache/spark/benchmarks/UTF8StringBenchmark.scala
and got:
###master
```sh
String compareTo: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2-7 byte 798 / 817 82.2
12.2 1.0X
8-16 byte 827 / 851 79.3
12.6 1.0X
16-32 byte 891 / 909 73.6
13.6 0.9X
512-1024 byte 1311 / 1331 50.0
20.0 0.6X
```
this branch
```sh
String compareTo: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2-7 byte 942 / 972 69.6
14.4 1.0X
8-16 byte 460 / 475 142.3
7.0 2.0X
16-32 byte 488 / 530 134.4
7.4 1.9X
512-1024 byte 788 / 809 83.2
12.0 1.2X
```
so as expected for very short strings we're taking a hit small hit, but are
much faster for longer strings.
We could do something like this to not take the hit for short strings (this
is the fastest I could find, admittedly not as readable as before but this is
the version with the minimum number of instructions as far as I can tell):
```java
@Override
public int compareTo(@Nonnull final UTF8String other) {
int len = Math.min(numBytes, other.numBytes);
if (len > 7) {
return compareLarge(
this.base, this.offset, numBytes, other.base, other.offset,
other.numBytes, len);
} else {
return compareSmall(
this.base, this.offset, numBytes, other.base, other.offset,
other.numBytes, len
);
}
}
private static int compareLarge(Object base1, long off1, int len1, Object
base2, long off2,
int len2, int len) {
int wordMax = len & ~7;
for (int i = 0; i < wordMax; i += 8) {
long left = getLong(base1, off1 + i);
long right = getLong(base2, off2 + i);
if (left != right) {
if (IS_LITTLE_ENDIAN) {
return UnsignedLongs.compare(Long.reverseBytes(left),
Long.reverseBytes(right));
} else {
return UnsignedLongs.compare(left, right);
}
}
}
for (int i = wordMax; i < len; i++) {
// In UTF-8, the byte should be unsigned, so we should compare them
as unsigned int.
int res = (Platform.getByte(base1, off1 + i) & 0xFF) -
(Platform.getByte(base2, off2 + i) & 0xFF);
if (res != 0) {
return res;
}
}
return compareSmall(base1, off1 + wordMax, len1, base2, off2 + wordMax,
len2, len - wordMax);
}
private static int compareSmall(Object base1, long off1, int len1, Object
base2, long off2,
int len2, int len) {
for (int i = 0; i < len; i++) {
// In UTF-8, the byte should be unsigned, so we should compare them
as unsigned int.
int res = (Platform.getByte(base1, off1 + i) & 0xFF) -
(Platform.getByte(base2, off2 + i) & 0xFF);
if (res != 0) {
return res;
}
}
return len1 - len2;
}
```
this version comes out neutral for small strings too:
```sh
String compareTo: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2-7 byte 811 / 829 80.8
12.4 1.0X
8-16 byte 453 / 466 144.7
6.9 1.8X
16-32 byte 477 / 487 137.5
7.3 1.7X
512-1024 byte 769 / 784 85.2
11.7 1.1X
2-7 byte 811 / 830 80.8
12.4 1.0X
```
added the run of the small comparison at the end again, to make sure we're
not just cheating the JIT by keeping `compareTo` artificially small so long as
we don't see any larger strings.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]