[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

kevinyu98 Mon, 22 May 2017 16:40:47 -0700

Github user kevinyu98 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12646#discussion_r117869273
  
    --- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -510,6 +510,69 @@ public UTF8String trim() {
         }
       }
     
    +  /**
    +   * Removes the given trim string from both ends of a string
    +   * @param trimString the trim character string
    +   */
    +  public UTF8String trim(UTF8String trimString) {
    +    // This method searches for each character in the source string, 
removes the character if it is found
    +    // in the trim string, stops at the first not found. It starts from 
left end, then right end.
    +    // It returns a new string in which both ends trim characters have 
been removed.
    +    int s = 0; // the searching byte position of the input string
    +    int i = 0; // the first beginning byte position of a non-matching 
character
    +    int e = 0; // the last byte position
    +    int numChars = 0; // number of characters from the input string
    +    int[] stringCharLen = new int[numBytes]; // array of character length 
for the input string
    +    int[] stringCharPos = new int[numBytes]; // array of the first byte 
position for each character in the input string
    +    int searchCharBytes;
    +
    +    while (s < this.numBytes) {
    +      UTF8String searchChar = copyUTF8String(s, s + 
numBytesForFirstByte(this.getByte(s)) - 1);
    +      searchCharBytes = searchChar.numBytes;
    +      // try to find the matching for the searchChar in the trimString set
    +      if (trimString.find(searchChar, 0) >= 0) {
    +        i += searchCharBytes;
    +      } else {
    +        // no matching, exit the search
    +        break;
    +      }
    +      s += searchCharBytes;
    +    }
    +
    +    if (i >= this.numBytes) {
    +      // empty string
    +      return UTF8String.EMPTY_UTF8;
    +    } else {
    +      //build the position and length array
    +      s = 0;
    +      while (s < numBytes) {
    +        stringCharPos[numChars] = s;
    +        stringCharLen[numChars]= numBytesForFirstByte(getByte(s));
    --- End diff --
    
    your mean record the  `stringCharPos` and `stringCharLen` before starting 
to do the trimLeft and trimRight? I can do that. I was thinking that these two 
arrays are only used by trimRight, in the case trimLeft trim all the source 
string, then we don't need to do the trimRight, so it will save some 
performance. But this is notcommon case. 
    
    > wzhfy 3 days ago Contributor
    Besides, can we reuse the code in trimLeft and trimRight? They look similar.
    
    your mean calling the trimLeft method first, then call the trimRight inside 
the trim method? Sure, I can do that. I was looking the existing trim space 
method, they didn't use this way. But for this new feature, the code is more 
complicate, I guess it will make the code more readable.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

Reply via email to