Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20796#discussion_r175573949
  
    --- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -57,12 +57,39 @@
       public Object getBaseObject() { return base; }
       public long getBaseOffset() { return offset; }
     
    -  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
    -    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    -    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    -    4, 4, 4, 4, 4, 4, 4, 4,
    -    5, 5, 5, 5,
    -    6, 6};
    +  /**
    +   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
    +   * indicates the size of the char. See Unicode standard in page 126:
    +   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
    +   *
    +   * Binary    Hex          Comments
    +   * 0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
    +   * 10xxxxxx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
    +   * 110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
    --- End diff --
    
    Actually this table is from the unicode standard (10.0, Table 3-6, page 
126):
    <img width="571" alt="screen shot 2018-03-19 at 9 23 14 pm" 
src="https://user-images.githubusercontent.com/1580697/37620318-2c173f7e-2bbc-11e8-8fa5-5f04d11925de.png";>
    
    0xC0, 0xC1 are first bytes of 2 bytes chars disallowed by UTF-8 (for now)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to