Probably one part of this is OT , but I think it could complement the discussion ;o)
On Mon, Jan 11, 2010 at 3:44 PM, M.-A. Lemburg <m...@egenix.com> wrote: > Olemis Lang wrote: >>> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner >>> <victor.stin...@haypocalc.com> wrote: >>>> Hi, >>>> >>>> Builtin open() function is unable to open an UTF-16/32 file starting with a >>>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8 >>>> file starting with a BOM, read()/readline() returns also the BOM whereas >>>> the >>>> BOM should be "ignored". >>>> >> [...] >>> >> >> I had similar issues too (please read below ;o) ... >> >> On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <gu...@python.org> wrote: >>> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy >>> talk. And for the other two, perhaps it would make more sense to have >>> a separate encoding-guessing function that takes a binary stream and >>> returns a text stream wrapping it with the proper encoding? >>> >> >> About guessing the encoding, I experienced this issue while I was >> developing a Trac plugin. What I was doing is as follows : >> >> - I guessed the MIME type + charset encoding using Trac MIME API (it >> was a CSV file encoded using UTF-16) >> - I read the file using `open` >> - Then wrapped the file using `codecs.EncodedFile` >> - Then used `csv.reader` >> >> ... and still get the BOM in the first value of the first row in the CSV >> file. > > You didn't say, but I presume that the charset guessing logic > returned either 'utf-16-le' or 'utf-16-be' Yes. In fact they return the full mimetype 'text/csv; charset=utf-16-le' ... ;o) > - those encodings don't > remove the leading BOM. ... and they should ? > The 'utf-16' codec will remove the BOM. > In this particular case there's nothing I can do, I have to process whatever charset is detected in the input ;o) >> {{{ >> #!python >> >>>>> mimetype >> 'utf-16-le' >>>>> ef = EncodedFile(f, 'utf-8', mimetype) >> }}} > > Same here: the UTF-8 codec will not remove the BOM, you have > to use the 'utf-8-sig' codec for that. > >> IMO I think I am +1 for leaving `open` just like it is, and use module >> `codecs` to deal with encodings, but I am strongly -1 for returning >> the BOM while using `EncodedFile` (mainly because encoding is >> explicitly supplied in ;o) > > Note that EncodedFile() doesn't do any fancy BOM detection or > filtering. ... directly. > This is the job of the codecs. > OK ... to come back to the scope of the subject, in the general case, I think that BOM (if any) should be handled by codecs, and therefore indirectly by EncodedFile . If that's a explicit way of working with encodings I'd prefer to use that wrapper explicitly in order to (encode | decode) the file and let the codec detect whether there's a BOM or not and «adjust» `tell`, `read` and everything else in that wrapper (instead of `open`). > Also note that BOM removal is only valid at the beginning of > a file. All subsequent BOM-bytes have to be read as-is (they > map to a zero-width non-breaking space) - without removing them. > ;o) -- Regards, Olemis. Blog ES: http://simelo-es.blogspot.com/ Blog EN: http://simelo-en.blogspot.com/ Featured article: Test cases for custom query (i.e report 9) ... PASS (1.0.0) - http://simelo.hg.sourceforge.net/hgweb/simelo/trac-gviz/rev/d276011e7297 _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com