[GitHub] [incubator-iceberg] electrum opened a new issue #293: Truncate transform on strings with Unicode characters

GitBox Tue, 16 Jul 2019 16:56:32 -0700

electrum opened a new issue #293: Truncate transform on strings with Unicode 
characters
URL: https://github.com/apache/incubator-iceberg/issues/293
 
 
   The specification for truncate says
   
   > *Substring of length `L`*
   
   but does not define what it is counting. I assume the intention is for it to 
be Unicode code points, since the specification says that
   
   > Character strings must be stored as UTF-8 encoded byte arrays
   
   However, the Java reference implementation uses 
`java.lang.CharSequence#subSequence`, thus the length is in terms of 16-bit 
code units, and thus is different for code points for characters outside of the 
Basic Multilingual Plane (BMP). Such code points require two characters, 
encoded using a high and low surrogate pair. Additionally, the truncation may 
happen in the middle of the surrogate pair, which is a form of corruption.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-iceberg] electrum opened a new issue #293: Truncate transform on strings with Unicode characters

Reply via email to