Github user tarekauel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7115#discussion_r34372399
  
    --- Diff: 
unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -515,4 +522,45 @@ public int hashCode() {
         }
         return result;
       }
    +
    +  /**
    +   * Encodes a string into a Soundex value. Soundex is an encoding used to 
relate similar names,
    +   * but can also be used as a general purpose scheme to find word with 
similar phonemes.
    +   */
    +  public UTF8String Soundex() {
    +    if (numBytes == 0) {
    +      return UTF8String.fromBytes(new byte[0]);
    +    }
    +    String tmp;
    +    byte data[] = {'0','0','0','0'};
    +    char ch;
    +    int idx = 0;
    +    int offset = numBytesForFirstByte(getByte(0));
    +    if(offset>1)
    +    {
    +        return UTF8String.fromBytes(getBytes());
    +    }
    +    int i = 1;
    +    int j = 1;
    +    data[0] = getByte(0);
    +
    +    while (i < numBytes) {
    +      if (j > 4) break;
    +      idx = getByte(i) - 'A';
    +      if (idx >= 0 && idx <= US_ENGLISH_MAPPING.length) {
    +        ch = US_ENGLISH_MAPPING[idx];
    +        if (ch - '0' > 0) {
    +          tmp = Character.toString(ch);
    +          System.arraycopy(tmp.getBytes(), 0, data, j, 1);
    +          if (getByte(i) == getByte(i + 1) || getByte(i) == getByte(i - 
1)) {
    --- End diff --
    
    You could put this check to Line `551`. Than you just need to check if the 
next character is equals. Then you skip. If you have a String like `aaa`, the 
last `a` would add the Soundex character. (Your coding adds the Soundex 
character for the first `a` and this is fine, but I guess you can simplify this 
by checking first if the next is equals and skip immediately.)
    ################
    `getByte(i + 1)` could be `IndexOutOfBounds`
    ################
    just checking comparing the next two bytes is not enough. You have to check 
if they map to the same Soundex character


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to