Re: UTF-8 in RFC 2279 and ISO 10646

Markus Kuhn Tue, 01 May 2001 05:38:32 -0700
Florian Weimer wrote on 2001-05-01 12:48 UTC:
> Sorry for this question which is slightly off topic:
> 
> Are the UTF-8 definitions in ISO/IEC 10646-1:2000 and RFC 2279
> identical or equivalent?

The differences are rather subtle. For instance, ISO/IEC 10646-1:2000
makes it clear that the UTF-8 sequences of U+D800 .. U+DFFF, U+FFFE and
U+FFFF are not allowed to occur in a UTF-8 stream, whereas RFC 2279
doesn't mention that. RFC 2279 on the other hand warns of the risk of
UTF-8 decoders accepting overlong sequences, which ISO 10646 does not
mention explicitly. ISO 10646-1 specifies ISO 2022 ESC sequences for
UTF-8, whereas the other standards don't. Unicode 3.0 required UTF-8
decoders to decode overlong sequences, whereas Unicode 3.1 requires them
to be treated as malformed sequences. Etc.

> Can any harm result if a nomative document
> refers to both definitions (this is a bad idea if the definitions are
> slightly different).

I'd see RFC 2279 more as the official MIME registration of the UTF-8
encoding as defined in ISO/IEC 10646-1. They are intended to be the same
thing, so I would reference RFC 2279 probably only in the context of
using the MIME charset namespace.

> And BTW: Does ISO 10646 define character properties (such as lowercase
> letter, uppercase letter, titlecase letter, other letter, decimal
> digit, other digit and so on)?

No, only Unicode does that.

Why don't you get a copy of ISO/IEC 10646-1:2000 yourself? With just
80�CHF for the PDF CD-ROM, it is the ISO standard with the lowest
per-page-price ever seen.

http://www.iso.ch/cate/d29819.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: UTF-8 in RFC 2279 and ISO 10646

Reply via email to