-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Guido van Rossum wrote: > On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz <gl...@twistedmatrix.com> > wrote: >> >> On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: >> >> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner >> <victor.stin...@haypocalc.com> wrote: >> >> Hi, >> >> Builtin open() function is unable to open an UTF-16/32 file starting with a >> >> BOM if the encoding is not specified (raise an unicode error). For an UTF-8 >> >> file starting with a BOM, read()/readline() returns also the BOM whereas the >> >> BOM should be "ignored". >> >> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy >> talk. And for the other two, perhaps it would make more sense to have >> a separate encoding-guessing function that takes a binary stream and >> returns a text stream wrapping it with the proper encoding? >> >> It *is* crazy, but unfortunately rather common. Wikipedia has a good >> description of the issues: >> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>. Basically, some >> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as >> being UTF-8, so it's become a convention to do that. That's not good >> enough, so you need to guess the encoding as well to make sure, but if there >> is a BOM and you can otherwise verify that the file is probably UTF-8 >> encoded, you should discard it. > > That doesn't make sense. If the file isn't UTF-8 you can't see the > BOM, because the BOM itself is UTF-8-encoded. > > (And yes, I know this happens. Doesn't mean we need to auto-guess by > default; there are lots of issues e.g. what should happen after > seeking to offset 0?)
The BOM should not be seekeable if the file is opened with the proposed "guess encoding from BOM" mode: it isn't properly part of the stream at all in that case. A UTF-8 BOM is an absurditiy, but it exists *everywhere* in the wild: Python would do wll to make it as easy as possible to consume such files, as well as the non-insane versions (UTF-16 / UTF-32 BOMs). In the best of all possible worlds, I would just try opening the file so: f = open('/path/to/file', 'r', encoding="DWIFM") and any BOM present would set the encoding for the remainder of the stream.. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktGzLsACgkQ+gerLs4ltQ5+cwCdGfycPdj6+cPfD23vH644SpHL sI0AoLGD7nfgMEJdJhBr90yjQQHfDgcJ =js+2 -----END PGP SIGNATURE----- _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com