Re: Unicode normalization problem

Claude Pache Thu, 02 Apr 2015 04:15:44 -0700

> Le 2 avr. 2015 à 01:22, Jordan Harband <[email protected]> a écrit :
> 
> Unfortunately we don't have a String#codepoints or something that would 
> return the number of code points as opposed to the number of characters (that 
> "length" returns) - something like that imo would greatly simplify explaining 
> the differences to people.
> 
> For the time being, I've been explaining that some characters are actually 
> made up of two, and the 💩 character (it's a fun example to use) is an example 
> of two characters combining to make one "code point". It's not a quick or 
> trivial thing to explain but people do seem to grasp it eventually.


And when they think to have understood, they are in fact still in great 
trouble, because they will confuse it with other unrelated issues like grapheme 
clusters and/or precomposed characters.

The issue here is specific to the UTF16 encoding, where some Unicode code 
points are encoded as a sequence of two 16-bit units; and ES strings are (by an 
accident of history) sequences of 16-bit units, not Unicode code points. I 
think it is important to stress that it is an issue of encoding, at least in 
order to have a chance to distinguish it from the other aforementioned issues.

(So, taking your example, the 💩 character is internally represented as a 
sequence of two 16-bit-units, not “characters”. And, very confusingly, the 
String methods that contain “char” in their name have nothing to do with 
“characters”.)

—Claude

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Unicode normalization problem

Reply via email to