Victor Stinner wrote:
Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit :
(...)
(And yes, I know this happens. Doesn't mean we need to auto-guess by
default; there are lots of issues e.g. what should happen after
seeking to offset 0?)
I wrote a new version of my patch (version 3):
* don't change the default behaviour: use open(filename, encoding="BOM") to
check the BOM is there is any
* fix for seek(0): always ignore the BOM
* add an unit test: check that the right encoding is detect, but also the the
BOM is ignored (especially after a seek(0))
BOM encoding doesn't work for writing into a file, so open(filename, "w",
encoding="BOM") raises a ValueError.
I think it's similar to universal newline mode. You can tell it that
you're reading UTF-something-encoded text (common forms only).
The preference is UTF-8, but it could be UTF-8-sig (from Windows), or
possibly UTF-16/32, which really need a BOM because there are multiple
bytes per codepoint, so the bytes could be big-endian or little-endian.
The BOM (or signature) tells you what the encoding is, defaulting to
UTF-8 if there's none. If it subsequently raises a DecodeError, then
so be it!
Maybe there should also be a way of determining what encoding it decided
it was, so that you can then write a new file in that same encoding.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com