I declare this thread irreparably broken. Do not make any decisions in this thread. Tell me (in another thread) when it's time to decide and I will.
On Sat, Aug 23, 2014 at 8:27 PM, Nick Coghlan <ncogh...@gmail.com> wrote: > On 24 August 2014 04:37, Oleg Broytman <p...@phdru.name> wrote: > > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore < > p.f.mo...@gmail.com> wrote: > >> Generally, it seems to be mostly a reaction to the repeated claims > >> that Python, or Windows, or whatever, is "broken". > > > > Ah, if that's the only problem I certainly can live with that. My > > problem is that it *seems* this anti-Unix attitude infiltrates Python > > core development. I very much hope I'm wrong and it really isn't. > > The POSIX locale based approach to handling encodings is genuinely > broken - it's almost as broken as code pages are on Windows. The > fundamental flaw is that locales encourage *bilingual* computing: > handling English plus one other language correctly. Given a global > internet, bilingual computing *is a fundamentally broken approach*. We > need multilingual computing (any human language, all the time), and > that means Unicode. > > As some examples of where bilingual computing breaks down: > > * My NFS client and server may have different locale settings > * My FTP client and server may have different locale settings > * My SSH client and server may have different locale settings > * I save a file locally and send it to someone with a different locale > setting > * I attempt to access a Windows share from a Linux client (or vice-versa) > * I clone my POSIX hosted git or Mercurial repository on a Windows client > * I have to connect my Linux client to a Windows Active Directory > domain (or vice-versa) > * I have to interoperate between native code and JVM code > > The entire computing industry is currently struggling with this > monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale > encoding/code pages) -> multilingual (Unicode) transition. It's been > going on for decades, and it's still going to be quite some time > before we're done. > > The POSIX world is slowly clawing its way towards a multilingual model > that actually works: UTF-8 > Windows (including the CLR) and the JVM adopted a different > multilingual model, but still one that actually works: UTF-16-LE > > POSIX is hampered by legacy ASCII defaults in various subsystems (most > notably the default locale) and the assumption that system metadata is > "just bytes" (an assumption that breaks down as soon as you have to > hand that metadata over to another machine that may have different > locale settings) > Windows is hampered by the fact they kept the old 8-bit APIs around > for backwards compatibility purposes, so applications using those APIs > are still only bilingual (at best) rather than multilingual. > JVM and CLR applications will at least handle the Basic Multilingual > Plane (UCS-2) correctly, but may not correctly handle code points > beyond the 16-bit boundary (this is the "Python narrow builds don't > handle Unicode correctly" problem that was resolved for Python 3.3+ by > PEP 393) > > Individual users (including some organisations) may have the luxury of > saying "well, all my clients and all my servers are POSIX, so I don't > care about interoperability with other platforms". As the providers of > a cross-platform runtime environment, we don't have that luxury - we > need to figure out how to get *all* the major platforms playing nice > with each other, regardless of whether they chose UTF-8 or UTF-16-LE > as the basis for their approach towards providing multilingual > computing environments. > > Historically, that question of cross platform interoperability for > open source software has been handled in a few different ways: > > * Don't really interoperate with anybody, reinvent all the wheels (the JVM > way) > * Emulate POSIX on Windows (the Cygwin/MinGW way) > * Let the application developer figure it out (the Python 2 way) > > The first approach is inordinately expensive - it took the resources > of Sun in its heyday to make it possible, and it effectively locks the > JVM out of certain kinds of computing (e.g. it's hard to do array > oriented programming in JVM languages, because the CPU and GPU > vectorisation features aren't readily accessible). > > The second approach prevents the creation of truly native Windows > applications, which makes it uncompelling as a way of attracting > Windows users - it sends a clear signal that the project doesn't > *really* care about supporting Windows as a platform, but instead only > grudgingly accepts that there are Windows users out there that might > like to use their software. > > The third approach is the one we tried for a long time with Python 2, > and essentially found to be an "experts only" solution. Yes, you can > *make* it work, but the runtime isn't set up so it works *by default*. > > The Unicode changes in Python 3 are a result of the Python core > development team saying "it really shouldn't be this hard for > application developers to get cross-platform interoperability between > correctly configured systems when dealing solely with correctly > encoded data and metadata". The idea of Python 3 is that applications > should require additional complexity solely to deal with *incorrectly* > configured systems and improperly encoded data and metadata (and, > ideally, the detection of the need for such handling should be "Python > 3 threw an exception" rather than "something further down the line > detected corrupted data"). > > This is software rather than magic, though - these improvements only > happen through people actually knuckling down and solving the related > problems. When folks complain about Python 3's operating system > interface handling causing problems in some situations? They're almost > always referring to areas where we're still relying on the locale > system on POSIX or the code page system on Windows. Both of those > approaches are irredeemably broken - the answer is to stop relying on > them, but appropriately updating the affected subsystems generally > isn't a trivial task. A lot of the affected code runs before the > interpreter is fully initialised, which makes it really hard to test, > and a lot of it is incredibly convoluted due to various configuration > options and platform specific details, which makes it incredibly hard > to modify without breaking anything. > > One of those areas is the fact that we still use the old 8-bit APIs to > interact with the Windows console. Those are just as broken in a > multilingual world as the other Windows 8-bit APIs, so Drekin came up > with a project to expose the Windows console as a UTF-16-LE stream > that uses the 16-bit APIs instead: > https://pypi.python.org/pypi/win_unicode_console > > I personally hope we'll be able to get the issues Drekin references > there resolved for Python 3.5 - if other folks hope for the same > thing, then one of the best ways to help that happen is to try out the > win_unicode_console module and provide feedback on what does and > doesn't work. > > Another was getting exceptions attempting to write OS data to > sys.stdout when the locale settings had been scrubbed from the > environment. For Python 3.5, we better tolerate that situation by > setting "errors=surrogateescape" on sys.stdout when the environment > claims "ascii" as a suitable encoding for talking to the operating > system (this is our way of saying "we don't actually believe you, but > also don't have the data we need to overrule you completely"). > > While I was going to wait for more feedback from Fedora folks before > pushing the idea again, this thread also makes me think it would be > worth our while to add more tools for dealing with surrogate escapes > and latin-1 binary data smuggling just to help make those techniques > more discoverable and accessible: > http://bugs.python.org/issue18814#msg225791 > > These various discussions are also giving me plenty of motivation to > get back to working on PEP 432 (the rewrite of the interpreter startup > sequence) for Python 3.5. A lot of these things are just plain hard to > change because of the complexity of the current startup code. > Redesigning that to use a cleaner, multiphase startup sequence that > gets the core interpreter running *before* configuring the operating > system integration should give us several more options when it comes > to dealing with some of these challenges. > > Regards, > Nick. > > -- > Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com