On 11/23/2010 2:11 PM, Alexander Belopolsky wrote:

This discussion motivated me to start looking into how well Python
library itself is prepared to deal with len(chr(i)) = 2.  I was not

Good idea!

surprised to find that textwrap does not handle the issue that well:

len(wrap(' \U00010140' * 80, 20))
12
len(wrap(' \U00000140' * 80, 20))
8

How well does textwrap handles composable pairs (letter + accent)? Does is count two codepoints as one char space? and avoid putting line breaks between? I suspect textwrap should be regarded as (extended?)_ascii_textwrap.

That module should probably be rewritten to properly implement  the
Unicode line breaking algorithm
<http://unicode.org/reports/tr14/tr14-22.html>.

Probably a good idea

Yet finding a bug in a str object method after a 5 min review was a
bit discouraging:

'xyz'.center(20, '\U00010140')
Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
TypeError: The fill character must be exactly one character long

Again, what does it do with letter + decorator combinations? It seems to me that the whole notion that one code point == one printed character space is broken once one leaves ascii. Perhaps we need an is_uchar function to recognize multi-code sequences, inclusing surrogate pairs, that represent one char for the purpose of character oriented functions.

Given the apparent difficulty of writing even basic text processing
algorithms in presence of surrogate pairs, I wonder how wise it is to
expose Python users to them.  As Wikipedia explains, [1]

"""
Because the most commonly used characters are all in the Basic
Multilingual Plane, converting between surrogate pairs and the
original values is often not tested thoroughly. This leads to
persistent bugs, and potential security holes, even in popular and
well-reviewed application software.
"""

So we did not test thoroughly enough and need to add appropriate unit tests as bugs are fixed.


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to