Terry J. Reedy <tjre...@udel.edu> added the comment:

I think the issues are slightly different.  #12486 is about the awkwardness of 
the API.  This is about a false error after jumping through the hoops, which I 
think Steve B did correctly.

Following the link, the Other_ID_Continue chars are

00B7          ; Other_ID_Continue # Po       MIDDLE DOT
0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT 
19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE

# Total code points: 12

The 2 Po chars fail, the 2 No chars work.  After looking at the tokenize 
module, I believe the problem is the re for Name is r'\w+' and the Po chars are 
not seen as \w word characters.

>>> r = re.compile(r'\w+', re.U)  
>>> re.match(r, 'ab\u0387cd')
<re.Match object; span=(0, 2), match='ab'>

I don't know if the bug is a too narrow definition of \w in the re module("most 
characters that can be part of a word in any language, as well as numbers and 
the underscore") or of Name in the tokenize module.

Before patching anything, I would like to know if the 2 Po Other chars are the 
only 2 not matched by \w.  Unless someone has done so already, at least a 
sample of chars from each category included in the definition of 'identifier' 
should be tested.

nosy: +serhiy.storchaka

Python tracker <rep...@bugs.python.org>
Python-bugs-list mailing list

Reply via email to