>> It *is* crazy, but unfortunately rather common. Wikipedia has a good >> description of the issues: >> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>. Basically, some >> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as >> being UTF-8, so it's become a convention to do that. That's not good >> enough, so you need to guess the encoding as well to make sure, but if there >> is a BOM and you can otherwise verify that the file is probably UTF-8 >> encoded, you should discard it. > > That doesn't make sense. If the file isn't UTF-8 you can't see the > BOM, because the BOM itself is UTF-8-encoded.
I think what Glyph meant is this: if a file starts with the UTF-8 signature, assume it's UTF-8. Then validate the assumption against the rest of the file also, and then process it as UTF-8. If the rest clearly is not UTF-8, assume that the UTF-8 signature is bogus. I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor). FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com