[issue24194] tokenize fails on some Other_ID_Start or Other_ID_Continue

Terry J. Reedy Sun, 16 Jan 2022 15:11:12 -0800

Terry J. Reedy <[email protected]> added the comment:

Udated doc link, which appears to be same:
https://docs.python.org/3.11/reference/lexical_analysis.html#identifiers


Updated property list linked in above:
https://www.unicode.org/Public/14.0.0/ucd/PropList.txt

Relevant content for this issue:

1885..1886    ; Other_ID_Start # Mn   [2] MONGOLIAN LETTER ALI GALI 
BALUDA..MONGOLIAN LETTER ALI GALI THREE BALUDA

2118          ; Other_ID_Start # Sm       SCRIPT CAPITAL P
212E          ; Other_ID_Start # So       ESTIMATED SYMBOL
309B..309C    ; Other_ID_Start # Sk   [2] KATAKANA-HIRAGANA VOICED SOUND 
MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
# Total code points: 6

00B7          ; Other_ID_Continue # Po       MIDDLE DOT
0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT 
NINE
19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE
# Total code points: 12

Codepoints of '℘·' opening example: 
'0x2118' Other_Id_start  Sm Script Capital P
'0xb7'   Other_Id_continue  P0 Middle dot

Except for the two Mongolian start characters, Meador's patch hardcodes the 
'Other' characters, thereby adding them without waiting for re to be fixed.  
While this will miss new additions without manual updates, it is better than 
missing everything for however many years.  I will make a PR with the additions 
and looks at the new tests.

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue24194>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24194] tokenize fails on some Other_ID_Start or Other_ID_Continue

Reply via email to