On Tue, Nov 23, 2010 at 2:18 PM, Amaury Forgeot d'Arc <amaur...@gmail.com> wrote: .. >> Given the apparent difficulty of writing even basic text processing >> algorithms in presence of surrogate pairs, I wonder how wise it is to >> expose Python users to them. > > This was already discussed two years ago: > > http://mail.python.org/pipermail/python-dev/2008-July/080900.html >
Thanks for the link. Let me summarize that discussion as I read it. The discussion starts with a reference to Guido's 2001 post which concluded with """ ... if we had wanted to use a variable-lenth internal representation, we should have picked UTF-8 way back, like Perl did. Moving to a UTF-16-based internal representation now will give us all the problems of the Perl choice without any of the benefits. """ [1] and proposes to move to USC-4 completely for Python 3.0. Note that this is not the option that I would like to discuss here. I don't propose to discuss abandoning narrow builds. Instead, I would like to discuss the costs and benefits associated with using variable width CES as an internal representation. This is where the 2008 discussion moved. OP did not realize that narrow build supported UTF-16 and like myself was surprised that application developers should be aware of surrogates if they want to use narrow builds. It was also suggested that Python itself is likely to have many bugs that can be triggered by non-BMP characters on narrow builds. Guido's response was: """ I'd also prefer to receive bug reports about breakages actually encountered in the wild than purely theoretical issues """ I don't think this is a good position to take. Programs that expect one code unit where Python may produce two are likely to have security holes. Even when programmers carefully sanitize their input, they are likely to do it at the code point level based on Unicode category and 0xFFFF boundary does not mean anything special for their applications. I think anyone who wants to write a robust application has two choices in practice: (a) use wide Unicode build; (b) restrict all text to BMP. Supporting surrogates at the application level is likely to be prohibitively expensive. It was later suggested that the main benefit of "UTF-16" builds is that they can easily interface with system libraries that are "UTF-16" based. However, how likely are these libraries be bug-free when it comes to non-BMP characters? The history teaches us that not very likely. Daniel Arbuckle presented arguments against imposing the burden of dealing with surrogates on application writers. [2] The recurrent theme on the thread was that non-BMP characters are rare and those who need them can afford the extra development cost associated with the surrogates. This point was very eloquently articulated by Guido: """ Who are the many here? Who are the few? I'd venture that (at least for the foreseeable future, say, until China will finally have taken over the role of the US as the de-facto dominant super power :-) the many are people whose app will never see a Unicode character outside the BMP, or who do such minimal string processing that their code doesn't care whether it's handling UTF-16-encoded data. """ [3] This argument can also be used to support the position that narrow builds should not support non-BMP characters. Later the discussion started resembling this thread when it went into a scholastic dispute over fine points in Unicode Standard terminology. :-) Then BDFL vetoed len(u"\U00012345") returning 1 on narrow builds. [4] I would be against that as well. I don't see len("\U00012345") == 2 as a big problem because application developers can simply avoid using \U literals if they don't want to support non-BMP characters. On the other hand, an option to warn users about non-BMP literals on a narrow build may be useful but it is easy to implement in lint-like tools. There were multiple suggestions for standard library additions to help application writers to deal with surrogate pairs, but as far as I can tell, nothing has been done in this area in the following two years. I don't think there is a recipe on how to fix legacy character-by-character processing loop such as for c in string: ... to make it iterate over code points consistently in wide and narrow builds. (Note that I am not asking for a grapheme iterator here. This is clearly an application level feature.) > So yes, wrap() and center() should be fixed. I opened an issue 10521 for that. [5] I am fully prepared to see it dismissed as "theoretical" and be closed with "won't fix" or linger indefinitely. Fixing it would most likely involve writing the second version of pad() utility function specifically for the narrow build. All examples I've seen in Python C code of dealing with surrogates came with hand-coded #ifndef Py_UNICODE_WIDE fragments and no user-friendly macros or APIs that would abstract it away. A quick grep for maxunicode in the standard library revealed only one case of "narrow-build aware" code: if sys.maxunicode != 65535: # XXX: negation does not work with big charsets return charset See Lib/sre_compile.py. Not exactly a model to follow. To conclude, I feel that rather than trying to fully support non-BMP characters as surrogate pairs in narrow builds, we should make it easier for application developers to avoid them. If abandoning internal use of UTF-16 is not an option, I think we should at least add an option for decoders that currently produce surrogate pairs to treat non-BMP characters as errors and handle them according to user's choice. [1] http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html [2] http://mail.python.org/pipermail/python-dev/2008-July/080912.html [3] http://mail.python.org/pipermail/python-dev/2008-July/080940.html [4] http://mail.python.org/pipermail/python-dev/2008-July/080916.html [5] http://bugs.python.org/issue10521 _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com