Olemis Lang wrote: >> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner >> <victor.stin...@haypocalc.com> wrote: >>> Hi, >>> >>> Builtin open() function is unable to open an UTF-16/32 file starting with a >>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8 >>> file starting with a BOM, read()/readline() returns also the BOM whereas the >>> BOM should be "ignored". >>> > [...] >> > > I had similar issues too (please read below ;o) ... > > On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <gu...@python.org> wrote: >> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy >> talk. And for the other two, perhaps it would make more sense to have >> a separate encoding-guessing function that takes a binary stream and >> returns a text stream wrapping it with the proper encoding? >> > > About guessing the encoding, I experienced this issue while I was > developing a Trac plugin. What I was doing is as follows : > > - I guessed the MIME type + charset encoding using Trac MIME API (it > was a CSV file encoded using UTF-16) > - I read the file using `open` > - Then wrapped the file using `codecs.EncodedFile` > - Then used `csv.reader` > > ... and still get the BOM in the first value of the first row in the CSV file.
You didn't say, but I presume that the charset guessing logic returned either 'utf-16-le' or 'utf-16-be' - those encodings don't remove the leading BOM. The 'utf-16' codec will remove the BOM. > {{{ > #!python > >>>> mimetype > 'utf-16-le' >>>> ef = EncodedFile(f, 'utf-8', mimetype) > }}} Same here: the UTF-8 codec will not remove the BOM, you have to use the 'utf-8-sig' codec for that. > IMO I think I am +1 for leaving `open` just like it is, and use module > `codecs` to deal with encodings, but I am strongly -1 for returning > the BOM while using `EncodedFile` (mainly because encoding is > explicitly supplied in ;o) Note that EncodedFile() doesn't do any fancy BOM detection or filtering. This is the job of the codecs. Also note that BOM removal is only valid at the beginning of a file. All subsequent BOM-bytes have to be read as-is (they map to a zero-width non-breaking space) - without removing them. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 11 2010) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com