Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20796#discussion_r175573949
--- Diff:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
public Object getBaseObject() { return base; }
public long getBaseOffset() { return offset; }
- private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2,
- 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
- 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
- 4, 4, 4, 4, 4, 4, 4, 4,
- 5, 5, 5, 5,
- 6, 6};
+ /**
+ * A char in UTF-8 encoding can take 1-4 bytes depending on the first
byte which
+ * indicates the size of the char. See Unicode standard in page 126:
+ * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+ *
+ * Binary Hex Comments
+ * 0xxxxxxx 0x00..0x7F Only byte of a 1-byte character encoding
+ * 10xxxxxx 0x80..0xBF Continuation bytes (1-3 continuation bytes)
+ * 110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
--- End diff --
Actually this table is from the unicode standard (10.0, Table 3-6, page
126):
<img width="571" alt="screen shot 2018-03-19 at 9 23 14 pm"
src="https://user-images.githubusercontent.com/1580697/37620318-2c173f7e-2bbc-11e8-8fa5-5f04d11925de.png">
0xC0, 0xC1 are first bytes of 2 bytes chars disallowed by UTF-8 (for now)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]