"John S. Yates, Jr." <[EMAIL PROTECTED]> writes:
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8. There is no MEANINGFUL definition
> of BOM in a UTF-8 string. But instead of stripping the wrapper
> and converting only the text payload Microsoft lazily treats
> both the wrapper and its payload as text.
The Unicode standard is at fault too.
It specifies UTF-16 and UTF-32 in variants:
- UTF-{16,32} with an optional BOM (defaulting to big endian if the
BOM is not present), where the BOM is mandatory if the first
character of the contents is U+FEFF (otherwise it would be mistaken
as a BOM).
- UTF-{16,32}{LE,BE} with a fixed endianness and without a BOM;
a U+FEFF in UTF-16BE must not be interpreted as a BOM, it's always
a part of the text.
The problem is that it's not clear in the case of UTF-8. Formally it
doesn't have a BOM, but the standard includes some ambiguous wording
that various software uses UTF-8 BOM and the presence of a BOM should
not affect the interpretation. It should clearly distinguish two
interpretations of UTF-8: one without the concept of a BOM, and one
which permits the BOM (and in fact makes it mandatory if the stream
begins with U+FEFF).
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe:
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com