Re: [Python-3000] Pre-PEP: Easy Text File Decoding

Walter Dörwald Wed, 13 Sep 2006 15:05:52 -0700

Jason Orendorff wrote:
> On 9/13/06, John S. Yates, Jr. <[EMAIL PROTECTED]> wrote:
>> It is a mistake on Microsoft's part to fail to strip the BOM
>> during conversion to UTF-8.
> 
> John, you're mistaken about the reason this BOM is here.
> 
> In Notepad at least, the BOM is intentionally generated when writing
> the file.  It's not a "mistake" or "laziness".  It's metadata.  (I
> admit the BOM was not originally invented for this purpose.)


In theory it's only metadata if external information says that it is, it 
practice it's unlikely that a charmap encoded file begins with these 
three bytes. nevertheless it's only a hint.

>> There is no MEANINGFUL definition of BOM in a UTF-8
>> string.
> 
> This thread is about files, not strings.  At the start of a file, a
> UTF-8 BOM is meaningful.  It means the file is UTF-8.

... and the first "character" in the file is U+FEFF. If you want the 
codec to drop the BOM on reading, use the UTF-8-Sig codec.

> [...]

Servus,
    Walter

_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

Reply via email to