Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

M.-A. Lemburg Sat, 23 Jan 2010 05:26:45 -0800

"Martin v. Löwis" wrote:
>> This all begs the question: why is there a default? and why is the
>> default a guess?
>>
>> I have to admit that I was completely oblivious to this potential
>> pitfall, and mostly that's because in the most common case, I am working
>> with ASCII files.
> 
> You answered your own question: it is this reason why there is a
> default for the IO encoding.
> 
>> It's just serendipity that most systems specify (if
>> not require) the locale encoding be an ASCII superset.
> 
> No, it's not. It is deliberate that the locale's encoding is
> an ASCII superset. On systems where it isn't, users are typically
> well aware that they are not using ASCII. On the systems where
> it is, users get completely oblivious of the entire issue.
> 
>> I already know that this suggestion will not get any following because,
>> for most people, it just works. However: "In the face of ambiguity,
>> refuse the temptation to guess." Would it really be that unfortunate to
>> force everyone to reconsider what they are doing when they open() files?
> 
> Yes, definitely. It is this very reasoning that caused Python 2.x to
> use ASCII as the default encoding (when mixing strings and unicode),
> and, for the entire lifetime of 2.x, has caused endless pain for
> developers, which simply fail to understand the notion of encodings
> in the first place. The majority of developers is unable to get it
> right, in particular if their native language is English. These
> developers just hate Unicode. They google for solutions, and come
> up with all kinds of proposals which are all wrong (such as reloading
> the sys module to get back sys.setdefaultencoding, to then set it
> to UTF-8).
> 
> So for the limited case of text IO, Python 3.x now makes a guess.
> However, this guess is not in the face of ambiguity: it is the
> locale that the user (or his administrator) has selected, which
> identifies the language that they speak and the character encoding
> they use for text. So if Python also uses that encoding, it's not
> really an ambiguous guess.


No, but it's most likely a wrong guess, since text files don't
really have anything to do with what the user wants to see in
a user interface.

It may be a good guess to stdin/out/err, since these provide part
of the user interface (if they are connected to a TTY), but not
for arbitrary text files which are normally meant for data exchange.

With the current guessing, you will get different encodings for
the text files depending on the locale setting of the user running
the application, so if you create a file on a Linux machine
you'll write a UTF-8 file, then try to open it on Windows and process
garbage, since the Windows Python installation will assume CP1252:

>>> print u'äöü'.encode('utf-8').decode('cp1252')
Ã¤Ã¶Ã¼

If you're lucky, you'll notice, if not, you'll work with corrupted
data.

I think that's what the Zen sentence is all about: "In the face of
ambiguity, refuse the temptation to guess." - it saves you from
situations that are difficult to detect and recover from.

Esp. when processing data, it's usually better to fail and provide
an opportunity to fix the data, rather than proceeding based
on some guessed assumption.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 23 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

Reply via email to