Guido van Rossum wrote: > On 2/15/06, Fuzzyman <[EMAIL PROTECTED]> wrote: > >> Forcing the programmer to be aware of encodings, also pushes the same >> requirement onto the user (who is often the source of the text in question). >> > > The programmer shouldn't have to be aware of encodings most of the > time -- it's the job of the I/O library to determine the end user's > (as opposed to the language's) default encoding dynamically and act > accordingly. Users who use non-ASCII characters without informing the > OS of their encoding are in a world of pain, *unless* they use the OS > default encoding (which may vary per locale). If the OS can figure out > the default encoding, so can the Python I/O library. Many apps won't > have to go beyond this at all. > > Note that I don't want to use this OS/user default encoding as the > default encoding between bytes and strings; once you are reading bytes > you are writing "grown-up" code and you will have to be explicit. It's > only the I/O library that should automatically encode on write and > decode on read. > > >> Currently you can read a text file and process it - making sure that any >> changes/requirements only use ascii characters. It therefore doesn't matter >> what 8 bit ascii-superset encoding is used in the original. If you force the >> programmer to specify the encoding in order to read the file, they would >> have to pass that requirement onto their user. Their user is even less >> likely to be encoding aware than the programmer. >> > > I disagree -- the user most likely has set or received a default > encoding when they first got the computer, and that's all they are > using. If other tools (notepad, wordpad, emacs, vi etc.) can figure > out the encoding, so can Python's I/O library. > > I'm intrigued by the encoding guessing techniques you envisage. I currently use a modified version of something contained within docutils.
I read the file in binary and first check for UTF8 or UTF16 BOM. Then I try to decode the text using the following encodings (in this order) : ascii UTF8 locale.nl_langinfo(locale.CODESET) locale.getlocale()[1] locale.getdefaultlocale()[1] ISO8859-1 cp1252 (The encodings returned by the locale calls are only used on platforms for which they exist.) The first decode that doesn't blow up, I assume is correct. The problem I have is that I usually (for the application I have in mind anyway) then want to re-encode into a consistent encoding rather than back into the original encoding. If the encoding of the original (usually unspecified) is any arbitrary 8-bit ascii superset (as it usually is), then it will probably not blow up if decoded with any other arbitrary 8 bit encoding. This means I sometimes get junk. I'm curious if there is any extra things I could do ? This is possibly beyond the scope of this discussion (in which case I apologise), but we are discussing the techniques the I/O layer would use to 'guess' the encoding of a file opened in text mode - so maybe it's not so off topic. There is also the following cookbook recipe that uses an heuristic to guess encoding : http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/163743 XML, HTML, or other text streams may also contain additional information about their encoding - which be unreliable. :-) All the best, Michael Foord _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com