electrum opened a new issue #293: Truncate transform on strings with Unicode characters URL: https://github.com/apache/incubator-iceberg/issues/293 The specification for truncate says > *Substring of length `L`* but does not define what it is counting. I assume the intention is for it to be Unicode code points, since the specification says that > Character strings must be stored as UTF-8 encoded byte arrays However, the Java reference implementation uses `java.lang.CharSequence#subSequence`, thus the length is in terms of 16-bit code units, and thus is different for code points for characters outside of the Basic Multilingual Plane (BMP). Such code points require two characters, encoded using a high and low surrogate pair. Additionally, the truncation may happen in the middle of the surrogate pair, which is a form of corruption.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
