Re: [Python-Dev] Python and the Unicode Character Database

Terry Reedy Mon, 29 Nov 2010 11:27:00 -0800

On 11/29/2010 10:19 AM, M.-A. Lemburg wrote:

Nick Coghlan wrote:

On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg<[email protected]>  wrote:

If we would go down that road, we would also have to disable other
Unicode features based on locale, e.g. whether to apply non-ASCII
case mappings, what to consider whitespace, etc.


We don't do that for a good reason: Unicode is supposed to be
universal and not limited to a single locale.


Because parsing numbers is about more than just the characters used
for the individual digits. There are additional semantics associated
with digit ordering (for any number) and decimal separators and
exponential notation (for floating point numbers) and those vary by
locale. We deliberately chose to make the builtin numeric parsers
unaware of all of those things, and assuming that we can simply parse
other digits as if they were their ASCII equivalents and otherwise
assume a C locale seems questionable.


Sure, and those additional semantics are locale dependent, even
between ASCII-only locales. However, that does not apply to the
basic building blocks, the decimal digits themselves.

If the existing semantics can be adequately defined, documented and
defended, then retaining them would be fine. However, the language
reference needs to define the behaviour properly so that other
implementations know what they need to support and what can be chalked
up as being just an implementation accident of CPython. (As a point in
the plus column, both decimal.Decimal and fractions.Fraction were able
to handle the '١٢٣٤.٥٦' example in a manner consistent with the int
and float handling)


The support is built into the C API, so there's not really much
surprise there.

Regarding documentation, we'd just have to add that numbers may
be made up of an Unicode code point in the category "Nd".

See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section
4.6 for details....

"""
Decimal digits form a large subcategory of numbers consisting of those digits 
that can be
used to form decimal-radix numbers. They include script-specific digits, but 
exclude char-
acters such as Roman numerals and Greek acrophonic numerals. (Note that<1, 5>  
= 15 =
fifteen, but<I, V>  = IV = four.) Decimal digits also exclude the compatibility 
subscript or
superscript digits to prevent simplistic parsers from misinterpreting their 
values in context.
"""

int(), float() and long() (in Python2) are such simplistic
parsers.

Since you are the knowledgable advocate of the current behavior, perhapsyou could open an issue and propose a doc patch, even if not .rst formatted.


--
Terry Jan Reedy


_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python and the Unicode Character Database

Reply via email to