Tal Einat added the comment:

It seems that the unicodedata module already supplies relevant functions which 
can be used for this. For example, we can replace "char in 
self._id_first_chars" with something like:

from unicodedata import normalize, category
norm_char = normalize(char)[0]
is_id_first_char = norm_char_first == '_' or category(norm_char_first) in 
{"Lu", "Ll", "Lt", "Lm", "Lo", "Nl"}

I'm not sure what the "Other_ID_Start property" mentioned in [1] and [2] means, 
though. Can we get someone with more in-depth knowledge of unicode to help with 
this? 

The real question is how to do this *fast*, since HyperParser does a *lot* of 
these checks. Do you think caching would be a good approach?

See:
.. [1]: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
.. [2]: http://legacy.python.org/dev/peps/pep-3131/

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue21765>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to