Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/19355#discussion_r141131388
--- Diff:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -501,14 +501,13 @@ public UTF8String trim() {
int e = this.numBytes - 1;
// skip all of the space (0x20) in the left side
while (s < this.numBytes && getByte(s) == 0x20) s++;
- // skip all of the space (0x20) in the right side
- while (e >= 0 && getByte(e) == 0x20) e--;
- if (s > e) {
+ if (s == this.numBytes) {
// empty string
return EMPTY_UTF8;
- } else {
- return copyUTF8String(s, e);
}
+ // skip all of the space (0x20) in the right side
+ while (e >= 0 && getByte(e) == 0x20) e--;
--- End diff --
Nit: while you're optimizing this, you can bring the declaration of e down
here, as it won't be used unless there's a non-space char.
I think this condition can start with `e > s` too. At the end, s and e
point to the first and last non-space char. When the loop starts, s points to a
non-space char. So you can stop when e == s; this is the case of one non-space
char.
Might be worth adding test cases for an empty string, and single-non-char
string too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]