On 9/5/06, David Hopwood <[EMAIL PROTECTED]> wrote: > Guido van Rossum wrote: > > On 9/5/06, Paul Prescod <[EMAIL PROTECTED]> wrote: > > > >> Beyond all of that: It just seems wrong to me that I could send someone a > >> bunch of files and a Python program and their results processing them > >> would be different from mine, despite the fact that we run the same > >> version of > >> Python on the same operating system. > > > > And it seems just as wrong if Python doesn't do what the user expects. > > If I were a beginning Python user, I'd hate it if I had prepared a > > simple data file in vi or notepad and my Python program wouldn't read > > it right because Python's idea of encoding differs from my editor's. > > I don't know about vi, but notepad will open and save files that are not in > the system ("ANSI") encoding just fine. On opening it checks for a BOM and > auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose > "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the > Encoding drop-down box. > > This is exactly the behaviour that most users would expect of a well-behaved > Unicode-aware app. It should be as easy as possible to match this behaviour > in a Python program.
And this is exactly why I want the determination of the default encoding (i.e. the encoding to be used when opening a file when no explicit encoding is specified by the Python code that does the opening) to be open-ended, rather than picking some standard default like UTF-8 and saying (like Paul seems to want to say) "this is it". > > Sorry Paul, I appreciate your standards-driven perspective, but in > > this area I'd rather build in more flexibility than strictly needed, > > than too little. If it turns out that on a particular platform all > > files are in UTF-8, making Python *on that platform* always choose > > UTF-8 is simple enough. > > The problem is not the systems where all files are UTF-8, or all files are > another known charset. The problem is the platforms where half of the files > are UTF-8 and half are in some other charset, determined either by type or by > presence of a UTF-8 BOM. This is a *very* common situation, especially for > European users. Right. (And Paul appears to be ignorant of this.) > Such a user cannot set the locale to UTF-8, because that will break all of > their non-Unicode-aware applications. The Unicode-aware applications typically > have much better support for reading and writing files in charsets that are > not the system default. So in practice the locale has to be set to the "old" > charset during a migration to UTF-8. > > (Setting different locales for different applications is far too much hassle. > On Windows, although I believe it is technically possible to do the equivalent > of selecting a UTF-8 locale, most users don't know how to do it, even if they > want to use UTF-8 exclusively.) Right. Of course, "locale" and "encoding" are somewhat orthogonal issues; the encoding may be UTF-8 but that doesn't determine other aspects of the locale (such as language-specific collation order, or culture-specific formatting of numbers, dates and money). Now, some platforms may equate the two somehow, and on those platforms we would have to inspect the locale to tell the encoding; but other platforms may specify the encoding separate from the locale... -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com