jovanpavl-db commented on PR #48521:
URL: https://github.com/apache/spark/pull/48521#issuecomment-2433996650
> couple small fixes, but one major thing we should consider is a more
efficient implementation for `binaryTrim` - in general, let's try to:
>
> * minimize extra allocations as much as we can
> * let's use better trim chars querying (e.g. hashmap)
> * try to build the string in one pass
>
> for example, see the existing `trim` implementations in
CollationAwareUTF8String public static UTF8String trimRight( public static
UTF8String trimLeft(
I refactored a bit this both binaryTrim, lowercaseTrim function to get the
most optimal complexity. Here is a brief analysis (and also sanity check for
me).
**binaryTrim**:
1. Used hash set to get on average O(1) comparison when searching trimString
codes in srcString. Note that this is more efficient than original trim
implementation in UTF8String because there is linear search done, which brings
complexity to O(len(srcString) * len(trimString)).
2. Didn't knew that contains is O(n^2), used contain and ascii space code to
avoid it and it boils down checks if spaces should be ignored to O(1).
3. At the end we have :`
return UTF8String.concat(
srcString.substring(0, searchIndex),
srcString.substring(lastNonSpacePosition, srcString.numChars()));`
this might look suspicious and true we have 1 copy-ing of
O(len(searchIndex)) and 1 copying of O(number_trailing_spaces) and again
copying those 2 (when concatenating), which is in the worst case O(2 *
len(srcString)) O(srcString). So we did at the end get constant factor of 2,
and true we can avoid it. But bottom line even in UTF8String rtrim
implementation this copy-ing is done multiple time. On the other hand to avoid
this 1-liner you need to literally reimplementing copyUTF8String(start1, end1,
start2, end2) which I find ultra messy and frankly unnecessary, I'm really
against it.
**lcaseTrim**
1. Original code is minimally changed (not adding even constant factor).
Point 3. from binaryTrim holds here also.
**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]