> -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf > Of Rick McGowan
> The following public review issues are new: > > 25 Proposed Update UTR #17 Character Encoding Model 2004.01.27 I have submitted the following comments, copied here in case anyone wishes to discuss them: The draft text for TR17, section 5 says, "A simple character encoding scheme is a mapping of each code unit of a CCS into a unique serialized byte sequence." It goes on to define a compound CES. While not stated explicitly, Unicodes CESs do not fit the definition of a compound CES, and so the definition for simple CES must apply. The problem is that this definition cannot accommodate all seven Unicode CESs. Since it defines a CES as a mapping from each code unit, there are only two possible byte-order-dependent mappings for 16- and 32-bit code units. In other words, the distinction between UTF-16BE and UTF-16 data that is big-endian cannot be a CES distinction because individual code units are mapped in exactly the same way in both cases. A definition for simple CES must, at a minimum, refer to a mapping of *streams* of code units if it is to include details about a byte-order mark that may or may not occur at the beginning of a stream. I would suggest that, in order to accommodate the UTF-16 and UTF-32 CESs, an appropriate definition should actually be a level of abstraction away from "a mapping": a CES is a specification for mappings. Any mapping is necessarily deterministic, giving a specific output for each input. A mapping itself cannot serialize "in either big-endian or little-endian format"; it must be one or the other, unambiguously. On the other hand, a specification for how to map into byte sequences can be ambiguous in this regard. Thus, the UTF-16 CES can be considered a specification for mapping into byte sequences that allows a little-endian mapping or a big-endian mapping. Peter Peter Constable Globalization Infrastructure and Font Technologies Microsoft Windows Division

