UTR#17 comments (was RE: Unicode Public Review Issues update)

Peter Constable Fri, 28 Nov 2003 11:17:54 -0800

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of Rick McGowan



> The following public review issues are new:
> 
> 25   Proposed Update UTR #17 Character Encoding Model  2004.01.27

I have submitted the following comments, copied here in case anyone
wishes to discuss them:

The draft text for TR17, section 5 says, "A simple character encoding
scheme is a mapping of each code unit of a CCS into a unique serialized
byte sequence." It goes on to define a compound CES. While not stated
explicitly, Unicodes CESs do not fit the definition of a compound CES,
and so the definition for simple CES must apply.

The problem is that this definition cannot accommodate all seven Unicode
CESs. Since it defines a CES as a mapping from each code unit, there are
only two possible byte-order-dependent mappings for 16- and 32-bit code
units. In other words, the distinction between UTF-16BE and UTF-16 data
that is big-endian cannot be a CES distinction because individual code
units are mapped in exactly the same way in both cases.

A definition for simple CES must, at a minimum, refer to a mapping of
*streams* of code units if it is to include details about a byte-order
mark that may or may not occur at the beginning of a stream.

I would suggest that, in order to accommodate the UTF-16 and UTF-32
CESs, an appropriate definition should actually be a level of
abstraction away from "a mapping": a CES is a specification for
mappings. Any mapping is necessarily deterministic, giving a specific
output for each input. A mapping itself cannot serialize "in either
big-endian or little-endian format"; it must be one or the other,
unambiguously. On the other hand, a specification for how to map into
byte sequences can be ambiguous in this regard. Thus, the UTF-16 CES can
be considered a specification for mapping into byte sequences that
allows a little-endian mapping or a big-endian mapping.




Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

UTR#17 comments (was RE: Unicode Public Review Issues update)

Reply via email to