jovanpavl-db commented on PR #48521:
URL: https://github.com/apache/spark/pull/48521#issuecomment-2433996650

   > couple small fixes, but one major thing we should consider is a more 
efficient implementation for `binaryTrim` - in general, let's try to:
   > 
   > * minimize extra allocations as much as we can
   > * let's use better trim chars querying (e.g. hashmap)
   > * try to build the string in one pass
   > 
   > for example, see the existing `trim` implementations in 
CollationAwareUTF8String public static UTF8String trimRight( public static 
UTF8String trimLeft(
   
   I refactored a bit this both binaryTrim, lowercaseTrim  function to get the 
most optimal complexity. Here is a brief analysis (and also sanity check for 
me).
   
   **binaryTrim**:
   1. Used hash set to get on average O(1) comparison when searching trimString 
codes in srcString. Note that this is more efficient than original trim 
implementation in UTF8String  because there is linear search done, which brings 
complexity to O(len(srcString) * len(trimString)). 
   2. Didn't knew that contains is O(n^2), used contain and ascii space code to 
avoid it and it boils down checks if spaces should be ignored to O(1).
   3. At the end we have :`
       return UTF8String.concat(
         srcString.substring(0, searchIndex),
         srcString.substring(lastNonSpacePosition, srcString.numChars()));`
         this might look suspicious and true we have 1 copy-ing of 
O(len(searchIndex)) and 1 copying of O(number_trailing_spaces) and again 
copying those 2 (when concatenating), which is in the worst case O(2 * 
len(srcString)) O(srcString). So we did at the end get constant factor of 2, 
and true we can avoid it. But bottom line even in UTF8String rtrim 
implementation this copy-ing is done multiple time. On the other hand to avoid 
this 1-liner you need to literally reimplementing copyUTF8String(start1, end1, 
start2, end2) which I find ultra messy and frankly unnecessary, I'm really 
against it. 
   
   **lcaseTrim**
   1. Original code is minimally changed (not adding even constant factor). 
Point 3. from binaryTrim holds here also.
   **


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to