New submission from Serhiy Storchaka <[email protected]>:
This issue is inspired by the tweet
https://twitter.com/dabeaz/status/925787482515533830
>>> a = 'n'
>>> b = 'ñ'
>>> sys.getsizeof(a)
50
>>> sys.getsizeof(b)
74
>>> float(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: 'ñ'
>>> sys.getsizeof(b)
77
See also a discussion on Python-list
(https://mail.python.org/pipermail/python-list/2017-November/728149.html) and
Stack Overflow
(https://stackoverflow.com/questions/47062184/why-does-the-size-of-this-python-string-change-on-a-failed-int-conversion).
When convert a non-ASCII string which don't contain non-ASCII decimal digits
and spaces, this will fail, but will change the size of the input string due to
creating an internal UTF-8 representation.
This can look surprising for beginners, but there is nothing wrong here. There
are many cases in which an internal UTF-8 representation is created implicitly.
But looking on the code I have found that it is too complicated. Parsers to
int, float and complex call _PyUnicode_TransformDecimalAndSpaceToASCII() which
transforms non-ASCII decimal digits and spaces to ASCII. This functions uses
the general function fixup() which takes a transformation function, creates a
new string object, apply the transformation. It checks if the transformation
produces a string with larger maximal character code than the original string
and creates a new string object and repeat the transformation in that case.
Finally, if the resulting string is equal to the original string, destroy the
resulting string and returns the original string. In the past fixup() was used
for implementing methods like upper(). But now
_PyUnicode_TransformDecimalAndSpaceToASCII() is only the user of fixup(), and
it doesn't need all complicated logic of fixup(). For example, this
transformation never produces wider strings.
The proposed PR simplifies the code by getting rid of fixup() and
fix_decimal_and_space_to_ascii(). The semantic of
_PyUnicode_TransformDecimalAndSpaceToASCII() has been changed. It now always
produces ASCII string (this also simplifies caller places). If non-ASCII
characters which is not a decimal digit and is not a space is encountered, the
rest of the string is replaced with '?' which will cause an error in parsers.
The only visible behavior change (except not changing the size of the original
string) is changing the exception raised by float() and complex() when the
string contains lone surrogates from UnicodeEncodeError to ValueError (the same
as for other malformed strings). int() already contained a special case for
this and raised a ValueError.
Unpatched:
>>> int('\ud800')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\ud800'
>>> float('\ud800')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position
0: surrogates not allowed
>>> complex('\ud800')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position
0: surrogates not allowed
Patched:
>>> int('\ud800')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\ud800'
>>> float('\ud800')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '\ud800'
>>> complex('\ud800')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: complex() arg is a malformed string
This PR saves around 80 lines of code.
----------
assignee: serhiy.storchaka
components: Unicode
messages: 305824
nosy: ezio.melotti, haypo, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Simplify converting non-ASCII strings to int, float and complex
type: enhancement
versions: Python 3.7
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue31979>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com