"Martin v. Löwis" writes:

 > My bet is that the majority of Python applications written today do
 > "web" stuff. In the web, input encoding and output encoding are
 > fairly decorrelated - in particular for databases and files read
 > from disk.

Sure.  Which means that programmers have to do a lot of explicit codec
management anyway.  If you hide output codec management in libraries
and provide "convenient" defaults for input codecs, the end result is
intermittent mojibake that's hard to fix.  Especially if the output
gets saved to disk and the input thrown away, as is sometimes the case.

 > > You just can't get away from the need for explicit management of
 > > codecs if you want a robust internationalized application.  I don't
 > > object to giving users an easy way to get the behavior Michael
 > > proposes; it just should not be the *default*.
 > 
 > An easy way is pointless if it's not the default.

Sure, but that default should be set by the site, or in some cases by
the application as Tres Seaver suggests, not by the Python source
distribution.

 > get, and the only word you recognize in it is "unicode", which is,
 > as far as you know, a synonym for "hell".

Welcome to Hell^H^H^H^Hthe Hotel Internet.  "You can check out, but
you can never leave."

In a multilingual environment, you have three choices: code everything
in one universal coded character set, or manage codecs explicitly and
associate a character set to each body of content, or guess and accept
more or less frequent mojibake (and put off the day where you choose
one of the sane alternatives until it costs five times as much).  That
last choice should not be the default, however much the users demand
it.  The first choice is a much better (more Pythonic) default:

- UTF-8 is the one obvious way to do it.  It's portable to all
  interesting platforms and the default on many of them.  It is
  sufficient for almost all purposes (admittedly it may be costly to
  convert legacy content from its original coded character set, but in
  that case the "explicit management" option is usually viable), and
  it is well-supported by Python.

- Refusing to guess is easy to document, and easy to debug.  I see no
  great benefit to guessing to override the Zen.  Note that Michael is
  correct: in the presence of the UTF-8 signature, for practical
  purposes you're not guessing.  But that's only half the story: if
  behavior is *different* when there is *no* signature, then in those
  cases there is ambiguity and you *are* guessing.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to