Julien added the comment:
To dig further, the DIGIT_MASK and DECIMAL_MASK used in `unicodeobject.c` are
from `unicodectype.c` and they match values from `unicodetype_db.h` witch is
generated by `Tools/unicode/makeunicodedata.py` which built those masks this
way:
# decimal digit, integer digit
decimal = 0
if record[6]:
flags |= DECIMAL_MASK
decimal = int(record[6])
digit = 0
if record[7]:
flags |= DIGIT_MASK
digit = int(record[7])
if record[8]:
flags |= NUMERIC_MASK
numeric.setdefault(record[8], []).append(char)
Those "record"s are documented in
ftp://unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html in which fields 6,
7, and 8 are:
- 6 Decimal digit value N This is a numeric field. If the
character has the decimal digit property, as specified in Chapter 4 of the
Unicode Standard, the value of that digit is represented with an integer value
in this field
- 7 Digit value N This is a numeric field. If the character
represents a digit, not necessarily a decimal digit, the value is here. This
covers digits which do not form decimal radix forms, such as the compatibility
superscript digits
- 8 Numeric value N This is a numeric field. If the character has
the numeric property, as specified in Chapter 4 of the Unicode Standard, the
value of that character is represented with an integer or rational number in
this field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR FRACTION
ONE FIFTH Also included are numerical values for compatibility characters such
as circled numbers.
Which is very close of the actual documentation. Yet the documentation is
misleading using "This category includes digit characters" in the "isdecimal"
documentation.
Posssible rewriting:
isdecimal: Return true if all characters in the string are decimal characters
and there is at least one character, false otherwise. Decimal characters are
those that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC
DIGIT ZERO. Formally a decimal character is a character in the Unicode General
Category "Nd".
isdigit: Return true if all characters in the string are digits and there is at
least one character, false otherwise. Digits include decimal characters and
digits that need special handling, such as the compatibility superscript
digits. This covers digits which do not form decimal radix forms. Formally, a
digit is a character that has the property value Numeric_Type=Digit or
Numeric_Type=Decimal.
I don't think we can refactor more than this without rewriting documentation
for isnumeric which mentions the Unicode standard the same way.
----------
nosy: +sizeof
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26483>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com