Florian Weimer wrote on 2001-05-01 12:48 UTC:
> Sorry for this question which is slightly off topic:
>
> Are the UTF-8 definitions in ISO/IEC 10646-1:2000 and RFC 2279
> identical or equivalent?
The differences are rather subtle. For instance, ISO/IEC 10646-1:2000
makes it clear that the UTF-8 sequences of U+D800 .. U+DFFF, U+FFFE and
U+FFFF are not allowed to occur in a UTF-8 stream, whereas RFC 2279
doesn't mention that. RFC 2279 on the other hand warns of the risk of
UTF-8 decoders accepting overlong sequences, which ISO 10646 does not
mention explicitly. ISO 10646-1 specifies ISO 2022 ESC sequences for
UTF-8, whereas the other standards don't. Unicode 3.0 required UTF-8
decoders to decode overlong sequences, whereas Unicode 3.1 requires them
to be treated as malformed sequences. Etc.
> Can any harm result if a nomative document
> refers to both definitions (this is a bad idea if the definitions are
> slightly different).
I'd see RFC 2279 more as the official MIME registration of the UTF-8
encoding as defined in ISO/IEC 10646-1. They are intended to be the same
thing, so I would reference RFC 2279 probably only in the context of
using the MIME charset namespace.
> And BTW: Does ISO 10646 define character properties (such as lowercase
> letter, uppercase letter, titlecase letter, other letter, decimal
> digit, other digit and so on)?
No, only Unicode does that.
Why don't you get a copy of ISO/IEC 10646-1:2000 yourself? With just
80�CHF for the PDF CD-ROM, it is the ISO standard with the lowest
per-page-price ever seen.
http://www.iso.ch/cate/d29819.html
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/