[issue32987] tokenize.py parses unicode identifiers incorrectly

Terry J. Reedy Tue, 13 Mar 2018 18:10:25 -0700

Terry J. Reedy <tjre...@udel.edu> added the comment:

I think the issues are slightly different.  #12486 is about the awkwardness of 
the API.  This is about a false error after jumping through the hoops, which I 
think Steve B did correctly.


Following the link, the Other_ID_Continue chars are

00B7          ; Other_ID_Continue # Po       MIDDLE DOT
0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT 
NINE
19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE

# Total code points: 12

The 2 Po chars fail, the 2 No chars work.  After looking at the tokenize 
module, I believe the problem is the re for Name is r'\w+' and the Po chars are 
not seen as \w word characters.

>>> r = re.compile(r'\w+', re.U)  
>>> re.match(r, 'ab\u0387cd')
<re.Match object; span=(0, 2), match='ab'>

I don't know if the bug is a too narrow definition of \w in the re module("most 
characters that can be part of a word in any language, as well as numbers and 
the underscore") or of Name in the tokenize module.

Before patching anything, I would like to know if the 2 Po Other chars are the 
only 2 not matched by \w.  Unless someone has done so already, at least a 
sample of chars from each category included in the definition of 'identifier' 
should be tested.

----------
nosy: +serhiy.storchaka

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue32987>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue32987] tokenize.py parses unicode identifiers incorrectly

Reply via email to