On Fri, Jan 8, 2010 at 1:05 AM, "Martin v. Löwis" <mar...@v.loewis.de> wrote: >>> It *is* crazy, but unfortunately rather common. Wikipedia has a good >>> description of the issues: >>> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>. Basically, some >>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as >>> being UTF-8, so it's become a convention to do that. That's not good >>> enough, so you need to guess the encoding as well to make sure, but if there >>> is a BOM and you can otherwise verify that the file is probably UTF-8 >>> encoded, you should discard it. >> >> That doesn't make sense. If the file isn't UTF-8 you can't see the >> BOM, because the BOM itself is UTF-8-encoded. > > I think what Glyph meant is this: if a file starts with the UTF-8 > signature, assume it's UTF-8. Then validate the assumption against the > rest of the file also, and then process it as UTF-8. If the rest clearly > is not UTF-8, assume that the UTF-8 signature is bogus. > > I understood this proposal as a general processing guideline, not > something the io library should do (but, say, a text editor). > > FWIW, I'm personally in favor of using the UTF-8 signature. If people > consider them crazy talk, that may be because UTF-8 can't possibly have > a byte order - hence I call it a signature, not the BOM. As a signature, > I don't consider it crazy at all. There is a long tradition of having > magic bytes in files (executable files, Postscript, PDF, ... - see > /etc/magic). Having a magic byte sequence for plain text to denote the > encoding is useful and helps reducing moji-bake. This is the reason it's > used on Windows: notepad would normally assume that text is in the ANSI > code page, and for compatibility, it can't stop doing that. So the UTF-8 > signature gives them an exit strategy.
Sure. I said "crazy talk" only to stir up discussion. Which worked. :-) Also, I don't want Python's default behavior to change -- sniffing the encoding should be a separate option. -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com