Jim Jewett writes: > On 5/22/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > > That's why Java and C++ use \u, so you would write L\u00F6wis > > as an identifier. ... > > I think you are really arguing for \u escapes in identifiers here. > > Yes, that is effectively what I was suggesting. > > > *This* is truly unambiguous. I claim that it is also useless. > > It means users could see the usability benefits of PEP3131, but the > python internals could still work with ASCII only.
But this reasoning is not coherent. Python internals will have no problems with non-ASCII; in fact, they would have no problems with tokens containing Cf characters or even reserved code points. Just give an unambiguous grammar for tokens composed of code points. It's only when a human enters the loop (ie, presentation of the identifier on an output stream) that they cause problems. It's *users* who are at risk, not the Python translator, and if there are any usability benefits to be taken advantage of by *presenting* identifiers that don't stick to ASCII, the risks of confusing or deliberately obfuscated code inhere in that very presentation. Not in the internals. For example: > It simplifies checking for identifiers that *don't* stick to ASCII, Only if you assume that people will actually perceive the 10-character string "L\u00F6wis" as an identifier, regardless of the fact that any programmable editor can be trained to display the 5-character string "Löwis" in a very small amount of code. Conversely, any programmable editor can easily be trained to take the internal representation "Löwis" and display it as "L\u00F6wis", giving all the benefits of the representation you propose. But who would ever enable it? (I suppose this is what Martin means by "useless".) > which reduces some of the concerns about confusable characters, and > which ones to allow. For the given reasons above, it reduces no concerns at all, except to the extent that it makes use of human-readable identifiers as Python identifiers inconvenient. I conclude that IMO PEP 3131 is precisely correct in scope as far as it goes. The only issues PEP 3131 should be concerned with *defining* are those that cause problems with canonicalization, and the range of characters and languages allowed in the standard library. I propose it would be useful to provide a standard mechanism for auditing the input stream. There would be one implementation for the stdlib that complains[1] about non-ASCII characters and possibly non-English words, and IMO that should be the default (for the reasons Ka-Ping gives for opposing the whole PEP). A second one should provide a very conservative Unicode set, with provision for amendment as experience shows restriction to be desirable or extension to be safe. A third, allowing any character that can be canonicalized into the form that PEP 3131 allows internally, is left as an exercise for the reader wild 'n' crazy enough to want to use it. For user convenience, it would be nice if these were implemented using the codec interface, although if applied to raw input there would need to be some duplication of parsing logic (specifically, comments and strings would have to be passed unchecked). I suppose it would be too expensive to use the codec interface at the point of interning an identifier (but maybe not, since it only needs to happen when adding an identifier to the symbol table; later occurrances would be short-circuited by probing the table and finding the token there). Footnotes: [1] I'm not sure what "complain" would mean in practice, since the PEP acknowledges use cases for both non-ASCII and non-English in the stdlib. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com