Julien added the comment:

To dig further, the DIGIT_MASK and DECIMAL_MASK used in `unicodeobject.c` are 
from `unicodectype.c` and they match values from `unicodetype_db.h` witch is 
generated by `Tools/unicode/makeunicodedata.py` which built those masks this 
way:

    # decimal digit, integer digit
    decimal = 0
    if record[6]:
        flags |= DECIMAL_MASK
        decimal = int(record[6])
    digit = 0
    if record[7]:
        flags |= DIGIT_MASK
        digit = int(record[7])
    if record[8]:
        flags |= NUMERIC_MASK
        numeric.setdefault(record[8], []).append(char)

Those "record"s are documented in 
ftp://unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html in which fields 6, 
7, and 8 are:

 - 6    Decimal digit value     N       This is a numeric field. If the 
character has the decimal digit property, as specified in Chapter 4 of the 
Unicode Standard, the value of that digit is represented with an integer value 
in this field

 - 7    Digit value     N       This is a numeric field. If the character 
represents a digit, not necessarily a decimal digit, the value is here. This 
covers digits which do not form decimal radix forms, such as the compatibility 
superscript digits

 - 8    Numeric value   N       This is a numeric field. If the character has 
the numeric property, as specified in Chapter 4 of the Unicode Standard, the 
value of that character is represented with an integer or rational number in 
this field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR FRACTION 
ONE FIFTH Also included are numerical values for compatibility characters such 
as circled numbers.

Which is very close of the actual documentation. Yet the documentation is 
misleading using "This category includes digit characters" in the "isdecimal" 
documentation.

Posssible rewriting:

isdecimal: Return true if all characters in the string are decimal characters 
and there is at least one character, false otherwise. Decimal characters are 
those that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC 
DIGIT ZERO. Formally a decimal character is a character in the Unicode General 
Category "Nd".

isdigit: Return true if all characters in the string are digits and there is at 
least one character, false otherwise. Digits include decimal characters and 
digits that need special handling, such as the compatibility superscript 
digits. This covers digits which do not form decimal radix forms. Formally, a 
digit is a character that has the property value Numeric_Type=Digit or 
Numeric_Type=Decimal.

I don't think we can refactor more than this without rewriting documentation 
for isnumeric which mentions the Unicode standard the same way.

----------
nosy: +sizeof

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26483>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to