On Jan 7, 2010, at 11:21 PM, Guido van Rossum wrote:

> On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz <gl...@twistedmatrix.com> 
> wrote:
>> 
>> On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:
>>> 
>>> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
>>> talk. And for the other two, perhaps it would make more sense to have
>>> a separate encoding-guessing function that takes a binary stream and
>>> returns a text stream wrapping it with the proper encoding?
>> 
>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
>> description of the issues:
>> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>.  Basically, some
>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>> being UTF-8, so it's become a convention to do that.  That's not good
>> enough, so you need to guess the encoding as well to make sure, but if there
>> is a BOM and you can otherwise verify that the file is probably UTF-8
>> encoded, you should discard it.
> 
> That doesn't make sense. If the file isn't UTF-8 you can't see the
> BOM, because the BOM itself is UTF-8-encoded.

I'm saying that the BOM itself isn't enough to detect that the file is actually 
UTF-8.  If (for whatever reason: explicitly specified, guessed in some other 
way) the file's encoding is determined to be something else, the bytes 
comprising the BOM should be decoded as normal.  It's just that the UTF-8 
decoding of the BOM at the start of a file should be "".

> (And yes, I know this happens. Doesn't mean we need to auto-guess by
> default; there are lots of issues e.g. what should happen after
> seeking to offset 0?)

I think it's pretty clear that the BOM should still be skipped in that case ...

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to