Re: [Python-3000] Pre-PEP: Easy Text File Decoding

Marcin 'Qrczak' Kowalczyk Wed, 13 Sep 2006 06:38:25 -0700

"John S. Yates, Jr." <[EMAIL PROTECTED]> writes:

> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.  There is no MEANINGFUL definition
> of BOM in a UTF-8 string.  But instead of stripping the wrapper
> and converting only the text payload Microsoft lazily treats
> both the wrapper and its payload as text.


The Unicode standard is at fault too.

It specifies UTF-16 and UTF-32 in variants:

- UTF-{16,32} with an optional BOM (defaulting to big endian if the
  BOM is not present), where the BOM is mandatory if the first
  character of the contents is U+FEFF (otherwise it would be mistaken
  as a BOM).

- UTF-{16,32}{LE,BE} with a fixed endianness and without a BOM;
  a U+FEFF in UTF-16BE must not be interpreted as a BOM, it's always
  a part of the text.

The problem is that it's not clear in the case of UTF-8. Formally it
doesn't have a BOM, but the standard includes some ambiguous wording
that various software uses UTF-8 BOM and the presence of a BOM should
not affect the interpretation. It should clearly distinguish two
interpretations of UTF-8: one without the concept of a BOM, and one
which permits the BOM (and in fact makes it mandatory if the stream
begins with U+FEFF).

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

Reply via email to