Julien added the comment: To dig further, the DIGIT_MASK and DECIMAL_MASK used in `unicodeobject.c` are from `unicodectype.c` and they match values from `unicodetype_db.h` witch is generated by `Tools/unicode/makeunicodedata.py` which built those masks this way:
# decimal digit, integer digit decimal = 0 if record[6]: flags |= DECIMAL_MASK decimal = int(record[6]) digit = 0 if record[7]: flags |= DIGIT_MASK digit = int(record[7]) if record[8]: flags |= NUMERIC_MASK numeric.setdefault(record[8], []).append(char) Those "record"s are documented in ftp://unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html in which fields 6, 7, and 8 are: - 6 Decimal digit value N This is a numeric field. If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented with an integer value in this field - 7 Digit value N This is a numeric field. If the character represents a digit, not necessarily a decimal digit, the value is here. This covers digits which do not form decimal radix forms, such as the compatibility superscript digits - 8 Numeric value N This is a numeric field. If the character has the numeric property, as specified in Chapter 4 of the Unicode Standard, the value of that character is represented with an integer or rational number in this field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values for compatibility characters such as circled numbers. Which is very close of the actual documentation. Yet the documentation is misleading using "This category includes digit characters" in the "isdecimal" documentation. Posssible rewriting: isdecimal: Return true if all characters in the string are decimal characters and there is at least one character, false otherwise. Decimal characters are those that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO. Formally a decimal character is a character in the Unicode General Category "Nd". isdigit: Return true if all characters in the string are digits and there is at least one character, false otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which do not form decimal radix forms. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal. I don't think we can refactor more than this without rewriting documentation for isnumeric which mentions the Unicode standard the same way. ---------- nosy: +sizeof _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26483> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com