On Wed, 2010-01-20 at 22:37 +0100, M.-A. Lemburg wrote:
> David Malcolm wrote:
> > I'm thinking of making this downstream change to Fedora's site.py (and
> > possibly in future RHEL releases) so that the default encoding
> > automatically picks up the encoding from the locale:
> > 
> >  def setencoding():
> >      """Set the string encoding used by the Unicode implementation.  The
> >      default is 'ascii', but if you're willing to experiment, you can
> >      change this."""
> >      encoding = "ascii" # Default value set by _PyUnicode_Init()
> > -    if 0:
> > +    if 1:
> >          # Enable to support locale aware default string encodings.
> >          import locale
> >          loc = locale.getdefaultlocale()
> >          if loc[1]:
> >              encoding = loc[1]
> >      if 0:
> >          # Enable to switch off string to Unicode coercion and implicit
> >          # Unicode to string conversion.
> >          encoding = "undefined"
> >      if encoding != "ascii":
> >          # On Non-Unicode builds this will raise an AttributeError...
> >          sys.setdefaultencoding(encoding) # Needs Python Unicode build !
> > 
> > I've written up extensive notes on the change and the history of the
> > issue here:
> > https://fedoraproject.org/wiki/Features/PythonEncodingUsesSystemLocale
> > 
> > Please let me know if there are any errors on that page!
> > 
> > The aim is to avoid strange behavior changes when running a script
> > within a shell pipeline/cronjob as opposed to at a tty (and to capture
> > some of the bizarre cornercases, for example, I found the behavior of
> > the pango/pygtk modules particularly surprising).
> > 
> > I mention it here as a "heads-up" about the change:
> >   - in case other distributions may want to do the same (or already do
> > so, though in my very brief survey no-one else seemed to), and
> >   - in case doing so breaks things in a way I'm not expecting; can
> > anyone see any flaws in my arguments?
> >   - in case other people find my notes on the issue useful
> > 
> > Hope this is helpful; can anyone see any potential problems with this
> > change?
> 
> Yes: such a change is unsupported by Python. The code you are
> changing should really have been removed many releases ago -
> it was originally only intended to serve as basis for experimentation
> on choosing the "right" default encoding.
> 
> The only supported default encodings in Python are:
> 
>  Python 2.x: ASCII
>  Python 3.x: UTF-8
> 
> If you change these, you are on your own and strange things will
> start to happen. The default encoding does not only affect
> the translation between Python and the outside world, but also
> all internal conversions between 8-bit strings and Unicode.
>
> Hacks like what's happening in the pango module (setting the
> default encoding to 'utf-8' by reloading the site module in
> order to get the sys.setdefaultencoding() API back) are just
> downright wrong and will cause serious problems since Unicode
> objects cache their default encoded representation.

Thanks for the feedback.

Note that pango isn't even doing the module reload hack; it's written in
C, and going in directly through the C API:
   PyUnicode_SetDefaultEncoding("utf-8");

I should mention that I've seen at least one C module in the wild that
exists merely to do this:

  #include <Python.h>
  void initutf8_please(void) {
     PyUnicode_SetDefaultEncoding("utf-8");
  }

so that the user could do "import utf8_please" at the top of their
scripts.

> If all you want to achieve is getting the encodings of
> stdout and stdin correctly setup for pipes, you should
> instead change the .encoding attribute of those (only).
Currently they are set up, but only when connected to a tty, which leads
to surprising behavior changes inside pipes/cronjobs (e.g. piping a
unicode string to "less" immediately breaks for code points above 127:
less is expecting locale-encoded bytes, but sys.stdout has encoding
"ASCII").

Similarly:
[da...@brick ~]$ python -c "import sys; print sys.stdout.encoding"
UTF-8
[da...@brick ~]$ python -c "import sys; print sys.stdout.encoding" | cat
None

Why only set an encoding on these streams when they're directly
connected to a tty?  I'll patch things to remove the isatty conditional
if that's acceptable.

(the tty-logic to do it appeared with the initial commit that added
locale-encoding support to sys.std[in|out], in sysmodule.c:
http://svn.python.org/view?view=rev&revision=32719
and was later moved from sysmodule.c to pythonrun.c:
http://svn.python.org/view?view=rev&revision=33817 
it later grew to affect stderr:
http://svn.python.org/view?view=rev&revision=43581
again, only if directly connected to a tty)

Dave

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to