Serhiy Storchaka <storchaka+cpyt...@gmail.com> added the comment: This issue and issue12486 doesn't have any common except that both are related to the tokenize module.
There are two bugs: a too narrow definition of \w in the re module (see issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module. >>> allchars = list(map(chr, range(0x110000))) >>> start = [c for c in allchars if c.isidentifier()] >>> cont = [c for c in allchars if ('a'+c).isidentifier()] >>> import re, regex, unicodedata >>> for c in regex.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, >>> ord(c), unicodedata.name(c, '?'))) ... '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in regex.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, >>> ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '·' U+0387 GREEK ANO TELEIA '፩' U+1369 ETHIOPIC DIGIT ONE '፪' U+136A ETHIOPIC DIGIT TWO '፫' U+136B ETHIOPIC DIGIT THREE '፬' U+136C ETHIOPIC DIGIT FOUR '፭' U+136D ETHIOPIC DIGIT FIVE '፮' U+136E ETHIOPIC DIGIT SIX '፯' U+136F ETHIOPIC DIGIT SEVEN '፰' U+1370 ETHIOPIC DIGIT EIGHT '፱' U+1371 ETHIOPIC DIGIT NINE '᧚' U+19DA NEW TAI LUE THAM DIGIT ONE '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, >>> ord(c), unicodedata.name(c, '?'))) ... 'ᢅ' U+1885 MONGOLIAN LETTER ALI GALI BALUDA 'ᢆ' U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, >>> ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '̀' U+0300 COMBINING GRAVE ACCENT '́' U+0301 COMBINING ACUTE ACCENT '̂' U+0302 COMBINING CIRCUMFLEX ACCENT '̃' U+0303 COMBINING TILDE ... [total 2177 characters] The second bug can be solved by adding 14 more characters in the pattern for Name. Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+' or Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*' But first the issue with \w should be resolved (if we don't want to add 2177 characters). The other solution is implementing property support in re (issue12734). ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue32987> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com