Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

Guido van Rossum Thu, 07 Jan 2010 20:25:09 -0800

On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz <[email protected]> wrote:
>
>
> On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:
>
> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
> <[email protected]> wrote:
>
> Hi,
>
> Builtin open() function is unable to open an UTF-16/32 file starting with a
>
> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
>
> file starting with a BOM, read()/readline() returns also the BOM whereas the
>
> BOM should be "ignored".
>
> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
> talk. And for the other two, perhaps it would make more sense to have
> a separate encoding-guessing function that takes a binary stream and
> returns a text stream wrapping it with the proper encoding?
>
> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
> description of the issues:
> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>.  Basically, some
> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
> being UTF-8, so it's become a convention to do that.  That's not good
> enough, so you need to guess the encoding as well to make sure, but if there
> is a BOM and you can otherwise verify that the file is probably UTF-8
> encoded, you should discard it.


That doesn't make sense. If the file isn't UTF-8 you can't see the
BOM, because the BOM itself is UTF-8-encoded.

(And yes, I know this happens. Doesn't mean we need to auto-guess by
default; there are lots of issues e.g. what should happen after
seeking to offset 0?)

-- 
--Guido van Rossum (python.org/~guido)
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

Reply via email to