Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

Walter Dörwald Sat, 09 Jan 2010 03:21:31 -0800

Victor Stinner wrote:

Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit :

Builtin open() function is unable to open an UTF-16/32 file starting with
a BOM if the encoding is not specified (raise an unicode error). For an
UTF-8 file starting with a BOM, read()/readline() returns also the BOM
whereas the BOM should be "ignored".

It depends. If you use the utf-8-sig encoding, it *will* ignore the
UTF-8 signature.

Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 andUTF-8+BOM files, you have to to detect the encoding (not an easy job) or toremove the BOM after the first read (much harder if you use a module likeConfigParser or csv).

Since my proposition changes the result TextIOWrapper.read()/readline()
for files starting with a BOM, we might introduce an option to open() to
enable the new behaviour. But is it really needed to keep the backward
compatibility?

Absolutely. And there is no need to produce a new option, but instead
use the existing options: define an encoding that auto-detects the
encoding from the family of BOMs. Maybe you call it encoding="sniff".


Good idea, I choosed open(filename, encoding="BOM").

On the surface this looks like there's an encoding named "BOM", butlooking at your patch I found that the check is still done inTextIOWrapper. IMHO the best approach would to the implement a *real*codec named "BOM" (or "sniff"). This doesn't require *any* changes tothe IO library. It could even be developed as a standalone project andpublished in the Cheeseshop.

To see how something like this can be done, take a look at the UTF-16codec, that switches to bigendian or littleendian mode depending on thefirst read/decode call.


Servus,
   Walter





_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

Reply via email to