Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython

2010-01-24 Thread Floris Bruynooghe
On Sat, Jan 23, 2010 at 10:09:14PM +0100, Cesare Di Mauro wrote:
 Introducing C++ is a big step, also. Aside the problems it can bring on some
 platforms, it means that C++ can now be used by CPython developers. It
 doesn't make sense to force people use C for everything but the JIT part. In
 the end, CPython could become a mix of C and C++ code, so a bit more
 difficult to understand and manage.

Introducing C++ is a big step, but I disagree that it means C++ should
be allowed in the other CPython code.  C++ can be problematic on more
obscure platforms (certainly when static initialisers are used) and
being able to build a python without C++ (no JIT/LLVM) would be a huge
benefit, effectively having the option to build an old-style CPython
at compile time.  (This is why I ased about --without-llvm being able
not to link with libstdc++).

Regards
Floris

-- 
Debian GNU/Linux -- The Power of Freedom
www.debian.org | www.gnu.org | www.kernel.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Michael Foord


On 23 Jan 2010, at 07:53, Martin v. Löwis mar...@v.loewis.de wrote:


[snip...]


Yes, definitely. It is this very reasoning that caused Python 2.x to
use ASCII as the default encoding (when mixing strings and unicode),
and, for the entire lifetime of 2.x, has caused endless pain for
developers, which simply fail to understand the notion of encodings
in the first place. The majority of developers is unable to get it
right, in particular if their native language is English. These
developers just hate Unicode. They google for solutions, and come
up with all kinds of proposals which are all wrong (such as reloading
the sys module to get back sys.setdefaultencoding, to then set it
to UTF-8).

So for the limited case of text IO, Python 3.x now makes a guess.
However, this guess is not in the face of ambiguity: it is the
locale that the user (or his administrator) has selected, which
identifies the language that they speak and the character encoding
they use for text. So if Python also uses that encoding, it's not
really an ambiguous guess.




However it is likely to be often wrong, and where the user's locale  
specifies an encoding like CP1252 then it will result in silent  
corruption rather than an immediate exception.


This is why I'm keen that by *default* Python should honour the UTF8  
signature when reading files; particularly given that programmers who  
don't/can't/won't understand encodings are likely to read files  
without specifying an encoding and a lot of the time it will *seem* to  
work.


Michael





Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Stephen J. Turnbull
Michael Foord writes:

  This is why I'm keen that by *default* Python should honour the UTF8  
  signature when reading files;

Unfortunately, your caveat about a lot of the time it will *seem* to
work applies to this as well.  The only way that honoring
signatures really works is if Python simply uses the UTF-8 codec on
input and output by default, regardless of locale.  Or perhaps if by
default Python should error out unless a signature is found.

Autodetection (ie, doing something different depending on the presence
or absence of the signature) does not really work, because for it to
work correctly, it needs to imply automatic resetting of the output
codec as well.  So what is your naive programmer supposed to expect
when writing a cat program?  Should the first encoding detected or
defaulted determine the output codec?  The last one?  UTF-8 uber
alles?

Such autodetection *can* be done fairly accurately.  After 20 years of
experimenting, Emacs has it pretty much right.  But ... Emacs almost
never runs without a human watching it.  And the code that handles
this is a mess of special cases and heuristics.  Not to mention
throwing more than a few exceptions in practice.  And in practice any
decisions that need to be made about disambiguating the output codec
are left up to the user.

  particularly given that programmers who don't/can't/won't
  understand encodings are likely to read files without specifying an
  encoding and a lot of the time it will *seem* to work.

But that's a different problem.  If you want to fix that you should
require an explicit codec parameter on all text I/O.  They'll still
just memorize the magic incantation and grumble about the extra
characters they have to type, but they'll have been warned.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Michael Foord

On 24/01/2010 14:23, Stephen J. Turnbull wrote:

Michael Foord writes:

This is why I'm keen that by *default* Python should honour the UTF8
signature when reading files;

Unfortunately, your caveat about a lot of the time it will *seem* to
work applies to this as well.  The only way that honoring
signatures really works is if Python simply uses the UTF-8 codec on
input and output by default, regardless of locale.  Or perhaps if by
default Python should error out unless a signature is found.

   
When reading text files the presence of the UTF-8 signature *almost 
invariably* means a UTF-8 encoding. Honouring this will almost always be 
better than using the wrong encoding. Of course there are caveats, but 
it will be a substantial improvement.




Autodetection (ie, doing something different depending on the presence
or absence of the signature) does not really work, because for it to
work correctly, it needs to imply automatic resetting of the output
codec as well.  So what is your naive programmer supposed to expect
when writing a cat program?  Should the first encoding detected or
defaulted determine the output codec?  The last one?  UTF-8 uber
alles?
   
Unless you keep the information about the original encoding along with 
the decoded string changing the (default0 output encoding depending on 
the input is simply not possible - and so not really relevant.



Michael

Such autodetection *can* be done fairly accurately.  After 20 years of
experimenting, Emacs has it pretty much right.  But ... Emacs almost
never runs without a human watching it.  And the code that handles
this is a mess of special cases and heuristics.  Not to mention
throwing more than a few exceptions in practice.  And in practice any
decisions that need to be made about disambiguating the output codec
are left up to the user.

particularly given that programmers who don't/can't/won't
understand encodings are likely to read files without specifying an
encoding and a lot of the time it will *seem* to work.

But that's a different problem.  If you want to fix that you should
require an explicit codec parameter on all text I/O.  They'll still
just memorize the magic incantation and grumble about the extra
characters they have to type, but they'll have been warned.
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Stephen J. Turnbull
Michael Foord writes:

  When reading text files the presence of the UTF-8 signature *almost 
  invariably* means a UTF-8 encoding. Honouring this will almost always be 
  better than using the wrong encoding. Of course there are caveats, but 
  it will be a substantial improvement.

Sure, that would be better than using the wrong encoding *if* the only
thing that matters is getting the input codec right.  But it's not
clear that it's an improvement from the naive programmers' point of
view, which needs to take into account the behavior of the whole
application.  Is it an improvement if it seems to work in testing,
and then munges something important to the boss because she has a
correspondent who uses UTF-8, not UTF-8-signature?  Maybe it's better
if it screws up almost all the time, so that the problem is detected
early!

  Unless you keep the information about the original encoding along with 
  the decoded string changing the (default0 output encoding depending on 
  the input is simply not possible - and so not really relevant.

That's throwing the baby out with the bathwater.  Very few practical
applications that care about the input encoding are going to be
willing to accept an output encoding that doesn't correspond to the
input encoding in an appropriate way.

*If* you are going to advocate guessing about the input encoding, even
based on very strong signals like the UTF-8 signature, then you really
have to advocate adding the infrastructure to ensure that the output
encoding is properly set.  If the output encoding is the programmer's
problem, then it's purely pandering to laziness not to ask them to
deal with the input encoding as well.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Antoine Pitrou
Stephen J. Turnbull stephen at xemacs.org writes:
 
 That's throwing the baby out with the bathwater.  Very few practical
 applications that care about the input encoding are going to be
 willing to accept an output encoding that doesn't correspond to the
 input encoding in an appropriate way.

Perhaps you are speaking with your emacs hat, where the purpose is to output to
the same file that serves as input. But most applications do not work in that
manner. They take some input and optionally produce an output in an entirely
different format (an other file format, or some database requests, or some
visual feedback, etc.). Therefore both encodings are decorrelated.

If I'm reading a configuration file the encoding of the configuration file will
not decide which charset my dynamic HTML pages are using.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Stephen J. Turnbull
Antoine Pitrou writes:

  Perhaps you are speaking with your emacs hat, where the purpose is
  to output to the same file that serves as input.

No, I'm not wearing my Emacs hat.  If I was, there would be no
problem.  You just use binary for most such purposes.  Historically
that was how even Emacs worked under X: you did input and output to
files in an 8-bit clean way, then picked your screen font to
correspond to your preferred encoding.  Of course that's assuming an
8-bit encoding, but historically Emacs couldn't do anything useful
with multibyte coding systems.

  But most applications do not work in that manner. They take some
  input and optionally produce an output in an entirely different
  format (an other file format, or some database requests, or some
  visual feedback, etc.). Therefore both encodings are
  decorrelated.

I concede that I have no better statistics on the matter than you do,
but I think that's wishful thinking.  It is quite common for pure
output to be mixed with echoed input, for example.  Even if a file
is converted to another format (eg, restructured text to LaTeX), it's
very common for the text encoding to be preserved.  Visual feedback
related to text files typically includes fragments of the text.  And
so on.

Of course it is possible to give examples where they can be
decorrelated.  But examples that support Michael's position are harder
to come by than you seem to think.  For example:

  If I'm reading a configuration file the encoding of the
  configuration file will not decide which charset my dynamic HTML
  pages are using.

But it *does* determine the charset of ErrorDocuments displayed by
Apache.  Users are likely to get somewhat confused if the
ErrorDocuments are in a different charset from your dynamic HTML.

You just can't get away from the need for explicit management of
codecs if you want a robust internationalized application.  I don't
object to giving users an easy way to get the behavior Michael
proposes; it just should not be the *default*.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Antoine Pitrou
Stephen J. Turnbull stephen at xemacs.org writes:
 
 But it *does* determine the charset of ErrorDocuments displayed by
 Apache.  Users are likely to get somewhat confused if the
 ErrorDocuments are in a different charset from your dynamic HTML.

Why would they? The browser picks the encoding from either the HTTP headers or
the HTML meta tag; these don't have to be the same for every document served by
the same domain.

 You just can't get away from the need for explicit management of
 codecs if you want a robust internationalized application.

I would answer it depends :-) But, as you said, I have to admit that it's
difficult to find any authoritative answer to the issue.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython

2010-01-24 Thread Cesare Di Mauro
2010/1/24 Floris Bruynooghe floris.bruynoo...@gmail.com

 Introducing C++ is a big step, but I disagree that it means C++ should
 be allowed in the other CPython code.  C++ can be problematic on more
 obscure platforms (certainly when static initialisers are used) and
 being able to build a python without C++ (no JIT/LLVM) would be a huge
 benefit, effectively having the option to build an old-style CPython
 at compile time.  (This is why I ased about --without-llvm being able
 not to link with libstdc++).

 Regards
 Floris


That's why I suggested the use of an external module, but if I have
understood correctly ceval.c needs to be changed using C++ for some parts.

If no C++ is required compiling the classic, non-jitted, CPython, my thought
was wrong, of course.

Cesare
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Martin v. Löwis
 However it is likely to be often wrong, and where the user's locale
 specifies an encoding like CP1252 then it will result in silent
 corruption rather than an immediate exception.

Why do you say that? Why do you think it will likely be often wrong?
Most likely, encoding text files with cp1252 will be exactly right,
and what the end user wanted.

 This is why I'm keen that by *default* Python should honour the UTF8
 signature when reading files; particularly given that programmers who
 don't/can't/won't understand encodings are likely to read files without
 specifying an encoding and a lot of the time it will *seem* to work.

That's probably a reasonable idea - but may also make things worse:
on writing, you'd still use cp1252, so you may end up outputting the
file in a different encoding. That would be particularly unfortunate
if you were merely performing some simple text replacement.

So whatever the API - there's always tradeoffs.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Martin v. Löwis
 So what is your naive programmer supposed to expect
 when writing a cat program?

This may be a bit out of context - however, a simple cat program should
open files in binary, and be done.

(not sure whether the average naive programmer is able to grasp the
notion of binary IO and to oppose to text IO, and whether he then would
be able to conclude that cat(1) is really about binary IO).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Michael Foord

On 24/01/2010 18:41, Martin v. Löwis wrote:

However it is likely to be often wrong, and where the user's locale
specifies an encoding like CP1252 then it will result in silent
corruption rather than an immediate exception.
 

Why do you say that? Why do you think it will likely be often wrong?
Most likely, encoding text files with cp1252 will be exactly right,
and what the end user wanted.

   


If the file has a UTF-8 signature then decoding the file with CP1252 
will almost always be wrong. I'm *not* suggesting switching to UTF8 by 
default, which we can't do as 3.1 stable is now out with the current 
behavior.



This is why I'm keen that by *default* Python should honour the UTF8
signature when reading files; particularly given that programmers who
don't/can't/won't understand encodings are likely to read files without
specifying an encoding and a lot of the time it will *seem* to work.
 

That's probably a reasonable idea - but may also make things worse:
on writing, you'd still use cp1252, so you may end up outputting the
file in a different encoding. That would be particularly unfortunate
if you were merely performing some simple text replacement.
   


Decoding a UTF-8 file with CP1252 will always succeed, but if it 
contains non-ascii characters then 'simple text replacement' will either 
not work or can corrupt the data. Reading as UTF-8 and then outputting 
as CP1252 (without data loss) is preferable in my opinion. If 'guessing' 
an encoding using the user's locale is acceptable then using another 
*very strong* indicator (i.e. the presence of the UTF8 signature) should 
also be acceptable.


In addition there are many programs where the reading of data is 
separate from the writing of data (configuration files, xml etc) - so 
that the encoding of any files written is logically distinct. In my 
experience only a minority of programs have destructively rewritten 
their input files. If the programmer is never specifying an encoding but 
has an input file with a UTF8 signature, writing output in the locale 
specified encoding is the *right* thing to do. It may be different from 
the input encoding but it will be successfully read back in next time 
around.



So whatever the API - there's always tradeoffs.
   


Sure. I think the presence of a UTF-8 signature strongly enough 
indicates the encoding of the file to make it a better choice than using 
the locale preference. Only of course where an explicit encoding was not 
specified.




Regards,
Martin
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Oleg Broytman
On Sun, Jan 24, 2010 at 07:45:20PM +0100, Martin v. L?wis wrote:
 This may be a bit out of context - however, a simple cat program should
 open files in binary, and be done.
 
 (not sure whether the average naive programmer is able to grasp the
 notion of binary IO and to oppose to text IO, and whether he then would
 be able to conclude that cat(1) is really about binary IO).

   Depends on the kind of cat and especially on the ways of using it. If
you ask cat to number lines (see manual for GNU cat) - what do lines mean
for binary IO?

Oleg.
-- 
 Oleg Broytmanhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Martin v. Löwis
 I concede that I have no better statistics on the matter than you do,
 but I think that's wishful thinking.  It is quite common for pure
 output to be mixed with echoed input, for example.  Even if a file
 is converted to another format (eg, restructured text to LaTeX), it's
 very common for the text encoding to be preserved.  Visual feedback
 related to text files typically includes fragments of the text.  And
 so on.

Please try to categorize Python applications. My bet is that the
majority of Python applications written today do web stuff. In
the web, input encoding and output encoding are fairly decorrelated -
in particular for databases and files read from disk.

 You just can't get away from the need for explicit management of
 codecs if you want a robust internationalized application.  I don't
 object to giving users an easy way to get the behavior Michael
 proposes; it just should not be the *default*.

An easy way is pointless if it's not the default. The complicated
way is to pass a parameter indicating what encoding you want to
use. It's complicated not because it's difficult to use, but because
you first need to grasp this entire unicode stuff. So if the easy
way wasn't the default, you are lost with the error message you
get, and the only word you recognize in it is unicode, which is,
as far as you know, a synonym for hell.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Antoine Pitrou
Oleg Broytman phd at phd.pp.ru writes:
 
Depends on the kind of cat and especially on the ways of using it. If
 you ask cat to number lines (see manual for GNU cat) - what do lines mean
 for binary IO?

b\n-separated chunks of data. See the docs:
http://docs.python.org/3.1/library/io.html#io.IOBase.readline


Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Alexander Belopolsky
On Sun, Jan 24, 2010 at 1:54 PM, Oleg Broytman p...@phd.pp.ru wrote:
..
   Depends on the kind of cat and especially on the ways of using it. If
 you ask cat to number lines (see manual for GNU cat) - what do lines mean
 for binary IO?

Maybe this is yet another reason why some kinds of cat are a bad idea:


cat isn't for printing files with line numbers, it isn't for
compressing multiple blank
lines, it's not for looking at non-printing ASCII characters, it's for
concatenating files.


- Rob Pike, UNIX Style, or cat -v Considered Harmful, USENIX Summer
Conference Proceedings, 1983.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python 2.5.5 Release Candidate 2

2010-01-24 Thread Martin v. Löwis
Subject: [ANN] Python 2.5.5 Release Candidate 2.

On behalf of the Python development team and the Python community, I'm
happy to announce the release candidate 2 of Python 2.5.5.

This is a source-only release that only includes security fixes. The
last full bug-fix release of Python 2.5 was Python 2.5.4. Users are
encouraged to upgrade to the latest release of Python 2.6 (which is
2.6.4 at this point).

This releases fixes issues with the logging and tarfile modules, and
with thread-local variables. Since the release candidate 1, additional
bugs have been fixed in the expat module. See the detailed release
notes at the website (also available as Misc/NEWS in the source
distribution) for details of bugs fixed.

For more information on Python 2.5.5, including download links for
various platforms, release notes, and known issues, please see:

http://www.python.org/2.5.5

Highlights of the previous major Python releases are available from
the Python 2.5 page, at

http://www.python.org/2.5/highlights.html

Enjoy this release,
Martin

Martin v. Loewis
mar...@v.loewis.de
Python Release Manager
(on behalf of the entire python-dev team)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Stephen J. Turnbull wrote:

 You just can't get away from the need for explicit management of
 codecs if you want a robust internationalized application.  I don't
 object to giving users an easy way to get the behavior Michael
 proposes; it just should not be the *default*.

Using any guessing based on the locale (which describes the codec used
byt the user's console, but is completely uncorrelated to any particular
file on the user's filesystem) is just about guaranteed to fail for lots
of users.

Any guessing at all should have to enabled by the application:  the
library doesn't have enough information to make a non[-data-mangling
guess in some of those cases.  Opening a file is one of those places
where people need to think about the bytes vs. text problem:  we can't
make that go away by playing whack-a-mole with the edge cases.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktc3dkACgkQ+gerLs4ltQ65RQCaA2PmxR1CUajMnZTVo4dKzlXM
k8QAn3jHz67QDf0RTWH/UrcTp7DRMTHP
=fzTi
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Stephen J. Turnbull
Antoine Pitrou writes:
  Stephen J. Turnbull stephen at xemacs.org writes:
   
   But it *does* determine the charset of ErrorDocuments displayed by
   Apache.  Users are likely to get somewhat confused if the
   ErrorDocuments are in a different charset from your dynamic HTML.
  
  Why would they? The browser picks the encoding from either the HTTP
  headers or the HTML meta tag; these don't have to be the same for
  every document served by the same domain.

Don't ask me why; I just know that my experience is that mojibake
happens on some Japanese sites with the default configuration of
Firefox 3.5 or 3.6.  Perhaps it's a bug in Firefox, but I think it's
more likely that folks are setting default charsets incompatibly with
ErrorDocuments.  Either way, it happens.

The point that you're avoiding is that in fact ErrorDocument literals
*do* pick up their charsets from the config file, and therefore that
charset cannot be decorrelated with the output charset in some
circumstances.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Stephen J. Turnbull
Martin v. Löwis writes:

  My bet is that the majority of Python applications written today do
  web stuff. In the web, input encoding and output encoding are
  fairly decorrelated - in particular for databases and files read
  from disk.

Sure.  Which means that programmers have to do a lot of explicit codec
management anyway.  If you hide output codec management in libraries
and provide convenient defaults for input codecs, the end result is
intermittent mojibake that's hard to fix.  Especially if the output
gets saved to disk and the input thrown away, as is sometimes the case.

   You just can't get away from the need for explicit management of
   codecs if you want a robust internationalized application.  I don't
   object to giving users an easy way to get the behavior Michael
   proposes; it just should not be the *default*.
  
  An easy way is pointless if it's not the default.

Sure, but that default should be set by the site, or in some cases by
the application as Tres Seaver suggests, not by the Python source
distribution.

  get, and the only word you recognize in it is unicode, which is,
  as far as you know, a synonym for hell.

Welcome to Hell^H^H^H^Hthe Hotel Internet.  You can check out, but
you can never leave.

In a multilingual environment, you have three choices: code everything
in one universal coded character set, or manage codecs explicitly and
associate a character set to each body of content, or guess and accept
more or less frequent mojibake (and put off the day where you choose
one of the sane alternatives until it costs five times as much).  That
last choice should not be the default, however much the users demand
it.  The first choice is a much better (more Pythonic) default:

- UTF-8 is the one obvious way to do it.  It's portable to all
  interesting platforms and the default on many of them.  It is
  sufficient for almost all purposes (admittedly it may be costly to
  convert legacy content from its original coded character set, but in
  that case the explicit management option is usually viable), and
  it is well-supported by Python.

- Refusing to guess is easy to document, and easy to debug.  I see no
  great benefit to guessing to override the Zen.  Note that Michael is
  correct: in the presence of the UTF-8 signature, for practical
  purposes you're not guessing.  But that's only half the story: if
  behavior is *different* when there is *no* signature, then in those
  cases there is ambiguity and you *are* guessing.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

2010-01-24 Thread Martin v. Löwis
 Using any guessing based on the locale (which describes the codec used
 byt the user's console, but is completely uncorrelated to any particular
 file on the user's filesystem)

No, it's not just the encoding of the console. It is also the encoding
that text editors will use, in absence of a more specific direction.

 Any guessing at all should have to enabled by the application:  the
 library doesn't have enough information to make a non[-data-mangling
 guess in some of those cases.  Opening a file is one of those places
 where people need to think about the bytes vs. text problem:  we can't
 make that go away by playing whack-a-mole with the edge cases.

Many developers are completely unable to make that choice, as Python 2
has demonstrated.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com