Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20796#discussion_r175679506
  
    --- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -187,8 +218,9 @@ public void writeTo(OutputStream out) throws 
IOException {
        * @param b The first byte of a code point
        */
       private static int numBytesForFirstByte(final byte b) {
    -    final int offset = (b & 0xFF) - 192;
    -    return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1;
    +    final int offset = b & 0xFF;
    +    byte numBytes = bytesOfCodePointInUTF8[offset];
    +    return (numBytes == 0) ? 1: numBytes; // Skip the first byte 
disallowed in UTF-8
    --- End diff --
    
    I think so. We jump over (skip by definition) such bytes and count it as 
one entity. If we don't count the bytes, we break `substring`, `toUpperCase`, 
`toLowerCase`, `trimRight/trimLeft` and etc. The reason of the changes is to 
not crash on bad input as previously we threw IndexOutOfBoundsexception on some 
wrong chars but could pass (count as 1) another wrong chars. This PR allows to 
cover whole range. I believe ignoring/removing of wrong chars should be 
addressed in changes for  https://issues.apache.org/jira/browse/SPARK-23741 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to