Re: Full Unicode strings strawman

Wes Garland Tue, 17 May 2011 20:01:59 -0700

Mark;

Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder
of the Unicode <http://en.wikipedia.org/wiki/Unicode> project and the
president of the Unicode
Consortium<http://en.wikipedia.org/wiki/Unicode_Consortium>since its
incorporation in 1991?

(If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5,
et al..those gave me lots of hair loss in the late 90s)

On 17 May 2011 21:55, Mark Davis ☕ <[email protected]> wrote:In the past, I
have read it thus, pseudo BNF:

>
>> UnicodeString => CodeUnitSequence // D80
>> CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
>> CodeUnit => <anything in the current encoding form> // D77
>>
>
> So far, so good. In particular, d800 is a code unit for UTF-16, since it is
> a code unit that can occur in some code unit sequence in UTF-16.
>

*head smack* - code unit, not code point.

>
>
>> This means that your original assertion -- that Unicode strings cannot
>> contain the high surrogate code points, regardless of meaning -- is in fact
>> correct.
>>
>
> That is incorrect.
>

Aie, Karumba!

If we have

   - a sequence of code points
   - taking on values between 0 and 0x1FFFFF
   - including high surrogates and other reserved values
   - independent of encoding

..what exactly are we talking about?  Can it be represented in UTF-16
without round-trip loss when normalization is not performed, for the code
points 0 through 0xFFFF?

Incidentally, I think this discussion underscores nicely why I think we
should work hard to figure out a way to hide UTF-16 encoding details from
user-end programmers.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to