Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython
On Sat, Jan 23, 2010 at 10:09:14PM +0100, Cesare Di Mauro wrote: Introducing C++ is a big step, also. Aside the problems it can bring on some platforms, it means that C++ can now be used by CPython developers. It doesn't make sense to force people use C for everything but the JIT part. In the end, CPython could become a mix of C and C++ code, so a bit more difficult to understand and manage. Introducing C++ is a big step, but I disagree that it means C++ should be allowed in the other CPython code. C++ can be problematic on more obscure platforms (certainly when static initialisers are used) and being able to build a python without C++ (no JIT/LLVM) would be a huge benefit, effectively having the option to build an old-style CPython at compile time. (This is why I ased about --without-llvm being able not to link with libstdc++). Regards Floris -- Debian GNU/Linux -- The Power of Freedom www.debian.org | www.gnu.org | www.kernel.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
On 23 Jan 2010, at 07:53, Martin v. Löwis mar...@v.loewis.de wrote: [snip...] Yes, definitely. It is this very reasoning that caused Python 2.x to use ASCII as the default encoding (when mixing strings and unicode), and, for the entire lifetime of 2.x, has caused endless pain for developers, which simply fail to understand the notion of encodings in the first place. The majority of developers is unable to get it right, in particular if their native language is English. These developers just hate Unicode. They google for solutions, and come up with all kinds of proposals which are all wrong (such as reloading the sys module to get back sys.setdefaultencoding, to then set it to UTF-8). So for the limited case of text IO, Python 3.x now makes a guess. However, this guess is not in the face of ambiguity: it is the locale that the user (or his administrator) has selected, which identifies the language that they speak and the character encoding they use for text. So if Python also uses that encoding, it's not really an ambiguous guess. However it is likely to be often wrong, and where the user's locale specifies an encoding like CP1252 then it will result in silent corruption rather than an immediate exception. This is why I'm keen that by *default* Python should honour the UTF8 signature when reading files; particularly given that programmers who don't/can't/won't understand encodings are likely to read files without specifying an encoding and a lot of the time it will *seem* to work. Michael Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Michael Foord writes: This is why I'm keen that by *default* Python should honour the UTF8 signature when reading files; Unfortunately, your caveat about a lot of the time it will *seem* to work applies to this as well. The only way that honoring signatures really works is if Python simply uses the UTF-8 codec on input and output by default, regardless of locale. Or perhaps if by default Python should error out unless a signature is found. Autodetection (ie, doing something different depending on the presence or absence of the signature) does not really work, because for it to work correctly, it needs to imply automatic resetting of the output codec as well. So what is your naive programmer supposed to expect when writing a cat program? Should the first encoding detected or defaulted determine the output codec? The last one? UTF-8 uber alles? Such autodetection *can* be done fairly accurately. After 20 years of experimenting, Emacs has it pretty much right. But ... Emacs almost never runs without a human watching it. And the code that handles this is a mess of special cases and heuristics. Not to mention throwing more than a few exceptions in practice. And in practice any decisions that need to be made about disambiguating the output codec are left up to the user. particularly given that programmers who don't/can't/won't understand encodings are likely to read files without specifying an encoding and a lot of the time it will *seem* to work. But that's a different problem. If you want to fix that you should require an explicit codec parameter on all text I/O. They'll still just memorize the magic incantation and grumble about the extra characters they have to type, but they'll have been warned. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
On 24/01/2010 14:23, Stephen J. Turnbull wrote: Michael Foord writes: This is why I'm keen that by *default* Python should honour the UTF8 signature when reading files; Unfortunately, your caveat about a lot of the time it will *seem* to work applies to this as well. The only way that honoring signatures really works is if Python simply uses the UTF-8 codec on input and output by default, regardless of locale. Or perhaps if by default Python should error out unless a signature is found. When reading text files the presence of the UTF-8 signature *almost invariably* means a UTF-8 encoding. Honouring this will almost always be better than using the wrong encoding. Of course there are caveats, but it will be a substantial improvement. Autodetection (ie, doing something different depending on the presence or absence of the signature) does not really work, because for it to work correctly, it needs to imply automatic resetting of the output codec as well. So what is your naive programmer supposed to expect when writing a cat program? Should the first encoding detected or defaulted determine the output codec? The last one? UTF-8 uber alles? Unless you keep the information about the original encoding along with the decoded string changing the (default0 output encoding depending on the input is simply not possible - and so not really relevant. Michael Such autodetection *can* be done fairly accurately. After 20 years of experimenting, Emacs has it pretty much right. But ... Emacs almost never runs without a human watching it. And the code that handles this is a mess of special cases and heuristics. Not to mention throwing more than a few exceptions in practice. And in practice any decisions that need to be made about disambiguating the output codec are left up to the user. particularly given that programmers who don't/can't/won't understand encodings are likely to read files without specifying an encoding and a lot of the time it will *seem* to work. But that's a different problem. If you want to fix that you should require an explicit codec parameter on all text I/O. They'll still just memorize the magic incantation and grumble about the extra characters they have to type, but they'll have been warned. -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Michael Foord writes: When reading text files the presence of the UTF-8 signature *almost invariably* means a UTF-8 encoding. Honouring this will almost always be better than using the wrong encoding. Of course there are caveats, but it will be a substantial improvement. Sure, that would be better than using the wrong encoding *if* the only thing that matters is getting the input codec right. But it's not clear that it's an improvement from the naive programmers' point of view, which needs to take into account the behavior of the whole application. Is it an improvement if it seems to work in testing, and then munges something important to the boss because she has a correspondent who uses UTF-8, not UTF-8-signature? Maybe it's better if it screws up almost all the time, so that the problem is detected early! Unless you keep the information about the original encoding along with the decoded string changing the (default0 output encoding depending on the input is simply not possible - and so not really relevant. That's throwing the baby out with the bathwater. Very few practical applications that care about the input encoding are going to be willing to accept an output encoding that doesn't correspond to the input encoding in an appropriate way. *If* you are going to advocate guessing about the input encoding, even based on very strong signals like the UTF-8 signature, then you really have to advocate adding the infrastructure to ensure that the output encoding is properly set. If the output encoding is the programmer's problem, then it's purely pandering to laziness not to ask them to deal with the input encoding as well. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Stephen J. Turnbull stephen at xemacs.org writes: That's throwing the baby out with the bathwater. Very few practical applications that care about the input encoding are going to be willing to accept an output encoding that doesn't correspond to the input encoding in an appropriate way. Perhaps you are speaking with your emacs hat, where the purpose is to output to the same file that serves as input. But most applications do not work in that manner. They take some input and optionally produce an output in an entirely different format (an other file format, or some database requests, or some visual feedback, etc.). Therefore both encodings are decorrelated. If I'm reading a configuration file the encoding of the configuration file will not decide which charset my dynamic HTML pages are using. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Antoine Pitrou writes: Perhaps you are speaking with your emacs hat, where the purpose is to output to the same file that serves as input. No, I'm not wearing my Emacs hat. If I was, there would be no problem. You just use binary for most such purposes. Historically that was how even Emacs worked under X: you did input and output to files in an 8-bit clean way, then picked your screen font to correspond to your preferred encoding. Of course that's assuming an 8-bit encoding, but historically Emacs couldn't do anything useful with multibyte coding systems. But most applications do not work in that manner. They take some input and optionally produce an output in an entirely different format (an other file format, or some database requests, or some visual feedback, etc.). Therefore both encodings are decorrelated. I concede that I have no better statistics on the matter than you do, but I think that's wishful thinking. It is quite common for pure output to be mixed with echoed input, for example. Even if a file is converted to another format (eg, restructured text to LaTeX), it's very common for the text encoding to be preserved. Visual feedback related to text files typically includes fragments of the text. And so on. Of course it is possible to give examples where they can be decorrelated. But examples that support Michael's position are harder to come by than you seem to think. For example: If I'm reading a configuration file the encoding of the configuration file will not decide which charset my dynamic HTML pages are using. But it *does* determine the charset of ErrorDocuments displayed by Apache. Users are likely to get somewhat confused if the ErrorDocuments are in a different charset from your dynamic HTML. You just can't get away from the need for explicit management of codecs if you want a robust internationalized application. I don't object to giving users an easy way to get the behavior Michael proposes; it just should not be the *default*. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Stephen J. Turnbull stephen at xemacs.org writes: But it *does* determine the charset of ErrorDocuments displayed by Apache. Users are likely to get somewhat confused if the ErrorDocuments are in a different charset from your dynamic HTML. Why would they? The browser picks the encoding from either the HTTP headers or the HTML meta tag; these don't have to be the same for every document served by the same domain. You just can't get away from the need for explicit management of codecs if you want a robust internationalized application. I would answer it depends :-) But, as you said, I have to admit that it's difficult to find any authoritative answer to the issue. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython
2010/1/24 Floris Bruynooghe floris.bruynoo...@gmail.com Introducing C++ is a big step, but I disagree that it means C++ should be allowed in the other CPython code. C++ can be problematic on more obscure platforms (certainly when static initialisers are used) and being able to build a python without C++ (no JIT/LLVM) would be a huge benefit, effectively having the option to build an old-style CPython at compile time. (This is why I ased about --without-llvm being able not to link with libstdc++). Regards Floris That's why I suggested the use of an external module, but if I have understood correctly ceval.c needs to be changed using C++ for some parts. If no C++ is required compiling the classic, non-jitted, CPython, my thought was wrong, of course. Cesare ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
However it is likely to be often wrong, and where the user's locale specifies an encoding like CP1252 then it will result in silent corruption rather than an immediate exception. Why do you say that? Why do you think it will likely be often wrong? Most likely, encoding text files with cp1252 will be exactly right, and what the end user wanted. This is why I'm keen that by *default* Python should honour the UTF8 signature when reading files; particularly given that programmers who don't/can't/won't understand encodings are likely to read files without specifying an encoding and a lot of the time it will *seem* to work. That's probably a reasonable idea - but may also make things worse: on writing, you'd still use cp1252, so you may end up outputting the file in a different encoding. That would be particularly unfortunate if you were merely performing some simple text replacement. So whatever the API - there's always tradeoffs. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
So what is your naive programmer supposed to expect when writing a cat program? This may be a bit out of context - however, a simple cat program should open files in binary, and be done. (not sure whether the average naive programmer is able to grasp the notion of binary IO and to oppose to text IO, and whether he then would be able to conclude that cat(1) is really about binary IO). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
On 24/01/2010 18:41, Martin v. Löwis wrote: However it is likely to be often wrong, and where the user's locale specifies an encoding like CP1252 then it will result in silent corruption rather than an immediate exception. Why do you say that? Why do you think it will likely be often wrong? Most likely, encoding text files with cp1252 will be exactly right, and what the end user wanted. If the file has a UTF-8 signature then decoding the file with CP1252 will almost always be wrong. I'm *not* suggesting switching to UTF8 by default, which we can't do as 3.1 stable is now out with the current behavior. This is why I'm keen that by *default* Python should honour the UTF8 signature when reading files; particularly given that programmers who don't/can't/won't understand encodings are likely to read files without specifying an encoding and a lot of the time it will *seem* to work. That's probably a reasonable idea - but may also make things worse: on writing, you'd still use cp1252, so you may end up outputting the file in a different encoding. That would be particularly unfortunate if you were merely performing some simple text replacement. Decoding a UTF-8 file with CP1252 will always succeed, but if it contains non-ascii characters then 'simple text replacement' will either not work or can corrupt the data. Reading as UTF-8 and then outputting as CP1252 (without data loss) is preferable in my opinion. If 'guessing' an encoding using the user's locale is acceptable then using another *very strong* indicator (i.e. the presence of the UTF8 signature) should also be acceptable. In addition there are many programs where the reading of data is separate from the writing of data (configuration files, xml etc) - so that the encoding of any files written is logically distinct. In my experience only a minority of programs have destructively rewritten their input files. If the programmer is never specifying an encoding but has an input file with a UTF8 signature, writing output in the locale specified encoding is the *right* thing to do. It may be different from the input encoding but it will be successfully read back in next time around. So whatever the API - there's always tradeoffs. Sure. I think the presence of a UTF-8 signature strongly enough indicates the encoding of the file to make it a better choice than using the locale preference. Only of course where an explicit encoding was not specified. Regards, Martin -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
On Sun, Jan 24, 2010 at 07:45:20PM +0100, Martin v. L?wis wrote: This may be a bit out of context - however, a simple cat program should open files in binary, and be done. (not sure whether the average naive programmer is able to grasp the notion of binary IO and to oppose to text IO, and whether he then would be able to conclude that cat(1) is really about binary IO). Depends on the kind of cat and especially on the ways of using it. If you ask cat to number lines (see manual for GNU cat) - what do lines mean for binary IO? Oleg. -- Oleg Broytmanhttp://phd.pp.ru/p...@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
I concede that I have no better statistics on the matter than you do, but I think that's wishful thinking. It is quite common for pure output to be mixed with echoed input, for example. Even if a file is converted to another format (eg, restructured text to LaTeX), it's very common for the text encoding to be preserved. Visual feedback related to text files typically includes fragments of the text. And so on. Please try to categorize Python applications. My bet is that the majority of Python applications written today do web stuff. In the web, input encoding and output encoding are fairly decorrelated - in particular for databases and files read from disk. You just can't get away from the need for explicit management of codecs if you want a robust internationalized application. I don't object to giving users an easy way to get the behavior Michael proposes; it just should not be the *default*. An easy way is pointless if it's not the default. The complicated way is to pass a parameter indicating what encoding you want to use. It's complicated not because it's difficult to use, but because you first need to grasp this entire unicode stuff. So if the easy way wasn't the default, you are lost with the error message you get, and the only word you recognize in it is unicode, which is, as far as you know, a synonym for hell. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Oleg Broytman phd at phd.pp.ru writes: Depends on the kind of cat and especially on the ways of using it. If you ask cat to number lines (see manual for GNU cat) - what do lines mean for binary IO? b\n-separated chunks of data. See the docs: http://docs.python.org/3.1/library/io.html#io.IOBase.readline Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
On Sun, Jan 24, 2010 at 1:54 PM, Oleg Broytman p...@phd.pp.ru wrote: .. Depends on the kind of cat and especially on the ways of using it. If you ask cat to number lines (see manual for GNU cat) - what do lines mean for binary IO? Maybe this is yet another reason why some kinds of cat are a bad idea: cat isn't for printing files with line numbers, it isn't for compressing multiple blank lines, it's not for looking at non-printing ASCII characters, it's for concatenating files. - Rob Pike, UNIX Style, or cat -v Considered Harmful, USENIX Summer Conference Proceedings, 1983. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Python 2.5.5 Release Candidate 2
Subject: [ANN] Python 2.5.5 Release Candidate 2. On behalf of the Python development team and the Python community, I'm happy to announce the release candidate 2 of Python 2.5.5. This is a source-only release that only includes security fixes. The last full bug-fix release of Python 2.5 was Python 2.5.4. Users are encouraged to upgrade to the latest release of Python 2.6 (which is 2.6.4 at this point). This releases fixes issues with the logging and tarfile modules, and with thread-local variables. Since the release candidate 1, additional bugs have been fixed in the expat module. See the detailed release notes at the website (also available as Misc/NEWS in the source distribution) for details of bugs fixed. For more information on Python 2.5.5, including download links for various platforms, release notes, and known issues, please see: http://www.python.org/2.5.5 Highlights of the previous major Python releases are available from the Python 2.5 page, at http://www.python.org/2.5/highlights.html Enjoy this release, Martin Martin v. Loewis mar...@v.loewis.de Python Release Manager (on behalf of the entire python-dev team) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Stephen J. Turnbull wrote: You just can't get away from the need for explicit management of codecs if you want a robust internationalized application. I don't object to giving users an easy way to get the behavior Michael proposes; it just should not be the *default*. Using any guessing based on the locale (which describes the codec used byt the user's console, but is completely uncorrelated to any particular file on the user's filesystem) is just about guaranteed to fail for lots of users. Any guessing at all should have to enabled by the application: the library doesn't have enough information to make a non[-data-mangling guess in some of those cases. Opening a file is one of those places where people need to think about the bytes vs. text problem: we can't make that go away by playing whack-a-mole with the edge cases. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktc3dkACgkQ+gerLs4ltQ65RQCaA2PmxR1CUajMnZTVo4dKzlXM k8QAn3jHz67QDf0RTWH/UrcTp7DRMTHP =fzTi -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Antoine Pitrou writes: Stephen J. Turnbull stephen at xemacs.org writes: But it *does* determine the charset of ErrorDocuments displayed by Apache. Users are likely to get somewhat confused if the ErrorDocuments are in a different charset from your dynamic HTML. Why would they? The browser picks the encoding from either the HTTP headers or the HTML meta tag; these don't have to be the same for every document served by the same domain. Don't ask me why; I just know that my experience is that mojibake happens on some Japanese sites with the default configuration of Firefox 3.5 or 3.6. Perhaps it's a bug in Firefox, but I think it's more likely that folks are setting default charsets incompatibly with ErrorDocuments. Either way, it happens. The point that you're avoiding is that in fact ErrorDocument literals *do* pick up their charsets from the config file, and therefore that charset cannot be decorrelated with the output charset in some circumstances. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Martin v. Löwis writes: My bet is that the majority of Python applications written today do web stuff. In the web, input encoding and output encoding are fairly decorrelated - in particular for databases and files read from disk. Sure. Which means that programmers have to do a lot of explicit codec management anyway. If you hide output codec management in libraries and provide convenient defaults for input codecs, the end result is intermittent mojibake that's hard to fix. Especially if the output gets saved to disk and the input thrown away, as is sometimes the case. You just can't get away from the need for explicit management of codecs if you want a robust internationalized application. I don't object to giving users an easy way to get the behavior Michael proposes; it just should not be the *default*. An easy way is pointless if it's not the default. Sure, but that default should be set by the site, or in some cases by the application as Tres Seaver suggests, not by the Python source distribution. get, and the only word you recognize in it is unicode, which is, as far as you know, a synonym for hell. Welcome to Hell^H^H^H^Hthe Hotel Internet. You can check out, but you can never leave. In a multilingual environment, you have three choices: code everything in one universal coded character set, or manage codecs explicitly and associate a character set to each body of content, or guess and accept more or less frequent mojibake (and put off the day where you choose one of the sane alternatives until it costs five times as much). That last choice should not be the default, however much the users demand it. The first choice is a much better (more Pythonic) default: - UTF-8 is the one obvious way to do it. It's portable to all interesting platforms and the default on many of them. It is sufficient for almost all purposes (admittedly it may be costly to convert legacy content from its original coded character set, but in that case the explicit management option is usually viable), and it is well-supported by Python. - Refusing to guess is easy to document, and easy to debug. I see no great benefit to guessing to override the Zen. Note that Michael is correct: in the presence of the UTF-8 signature, for practical purposes you're not guessing. But that's only half the story: if behavior is *different* when there is *no* signature, then in those cases there is ambiguity and you *are* guessing. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)
Using any guessing based on the locale (which describes the codec used byt the user's console, but is completely uncorrelated to any particular file on the user's filesystem) No, it's not just the encoding of the console. It is also the encoding that text editors will use, in absence of a more specific direction. Any guessing at all should have to enabled by the application: the library doesn't have enough information to make a non[-data-mangling guess in some of those cases. Opening a file is one of those places where people need to think about the bytes vs. text problem: we can't make that go away by playing whack-a-mole with the edge cases. Many developers are completely unable to make that choice, as Python 2 has demonstrated. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com