On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz <gl...@twistedmatrix.com> wrote: > > > On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: > > On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner > <victor.stin...@haypocalc.com> wrote: > > Hi, > > Builtin open() function is unable to open an UTF-16/32 file starting with a > > BOM if the encoding is not specified (raise an unicode error). For an UTF-8 > > file starting with a BOM, read()/readline() returns also the BOM whereas the > > BOM should be "ignored". > > I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy > talk. And for the other two, perhaps it would make more sense to have > a separate encoding-guessing function that takes a binary stream and > returns a text stream wrapping it with the proper encoding? > > It *is* crazy, but unfortunately rather common. Wikipedia has a good > description of the issues: > <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>. Basically, some > Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as > being UTF-8, so it's become a convention to do that. That's not good > enough, so you need to guess the encoding as well to make sure, but if there > is a BOM and you can otherwise verify that the file is probably UTF-8 > encoded, you should discard it.
That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com