Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

MRAB Fri, 08 Jan 2010 08:49:37 -0800

Victor Stinner wrote:

Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit :
(...)
(And yes, I know this happens. Doesn't mean we need to auto-guess by
default; there are lots of issues e.g. what should happen after
seeking to offset 0?)
I wrote a new version of my patch (version 3):
* don't change the default behaviour: use open(filename, encoding="BOM") tocheck the BOM is there is any
 * fix for seek(0): always ignore the BOM
* add an unit test: check that the right encoding is detect, but also the theBOM is ignored (especially after a seek(0))
BOM encoding doesn't work for writing into a file, so open(filename, "w",encoding="BOM") raises a ValueError.

I think it's similar to universal newline mode. You can tell it that
you're reading UTF-something-encoded text (common forms only).

The preference is UTF-8, but it could be UTF-8-sig (from Windows), or
possibly UTF-16/32, which really need a BOM because there are multiple
bytes per codepoint, so the bytes could be big-endian or little-endian.

The BOM (or signature) tells you what the encoding is, defaulting to
UTF-8 if there's none. If it subsequently raises a DecodeError, then
so be it!

Maybe there should also be a way of determining what encoding it decided
it was, so that you can then write a new file in that same encoding.
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

Reply via email to