On Sun, Mar 18, 2018 at 10:43 AM, C. Scott Ananian <[email protected]>
wrote:

> On Sun, Mar 18, 2018, 10:30 AM Anders Rundgren <
> [email protected]> wrote:
>
>> Violently agree but do not understand (I guess I'm just dumb...) why (for
>> example) sorting on UCS2/UTF-16 Code Units would not achieve the same goal
>> (although the result would differ).
>>
>
> Because there are JavaScript strings which do not form valid UTF-16 code
> units.  For example, the one-character string '\uD800'. On the input
> validation side, there are 8-bit strings which can not be decoded as
> UTF-8.  A complete sorting spec needs to describe how these are to be
> handled. For example, something like WTF-8: http://simonsapin.
> github.io/wtf-8/
>

Let's get terminology straight.
"\uD800" is a valid string of UTF-16 code units.   It is also a valid
string of codepoints.  It is not a valid string of scalar values.

http://www.unicode.org/glossary/#code_point : Any value in the Unicode
codespace; that is, the range of integers from 0 to 10FFFF16.
http://www.unicode.org/glossary/#code_unit : The minimal bit combination
that can represent a unit of encoded text for processing or interchange.
http://www.unicode.org/glossary/#unicode_scalar_value : Any Unicode *code
point <http://www.unicode.org/glossary/#code_point>* except high-surrogate
and low-surrogate code points. In other words, the ranges of integers 0 to
D7FF16 and E00016 to 10FFFF16 inclusive.
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to