Tal Einat added the comment:

Indeed, I seem to have been misinterpreting the grammar, despite taking care 
and reading it several times. This strengthens my opinion that we should use 
str.isidentifier() rather than attempt to correctly re-implement just the parts 
that we need.

Attached is a patch which fixes HyperParser._eat_identifier(), to the extent of 
my testing (tests included).

When non-ASCII characters are encountered, this patch uses Terry's suggestion 
of checking for valid identifier characters using ('a' + 
string_part).isidentifier(). It also employs his suggestion of how to avoid 
executing this check at every index, by skipping 4 characters at a time.

However, even with this fix, HyperParser.get_expression() still fails with 
non-ASCII Unicode strings. This is because it uses PyParse, which doesn't 
support Unicode! For example, it apparently replaces all non-ASCII characters 
with 'x'. I've added (in this patch) a few tests for this, which currently fail.

FWIW, PyParse includes a comment to this effect[1]:

<quote>
The parse functions have no idea what to do with Unicode, so
replace all Unicode characters with "x".  This is "safe"
so long as the only characters germane to parsing the structure
of Python are 7-bit ASCII.  It's *necessary* because Unicode
strings don't have a .translate() method that supports
deletechars.
</quote>

Properly resolving this issue will apparently require fixing PyParse to 
properly support Unicode.

.. [1]: 
http://hg.python.org/cpython/file/d25ae22cc992/Lib/idlelib/PyParse.py#l117

----------
keywords: +patch
Added file: 
http://bugs.python.org/file35876/taleinat.20140706.IDLE_HyperParser_unicode_ids.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue21765>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to