> 2. Python forbids these characters. Martin, JavaScript > treats these specially, and I think Python probably > should, too: > > The ECMAScript 3 standard for JavaScript requires the > tokenizer to throw away all Unicode format-control characters > (general category Cf). > > ECMAScript 4 will likely tweak this (an incompatible change) > to retain those characters only in strings and regexps. > I like that better.
I've added this as an open issue. It would be easy to add, but I would like to get some confirmation first that it actually helps writers of the RTL languages (preferably from some native speakers). The proposed change would be that Cf characters would be allowed *only* in and immediately around identifiers, and in string literals and comments, i.e. the scanner would work this way: - perform token classification only based on individual ASCII letters; classify all non-ASCII letters as potential identifiers. - for identifiers potential identifiers (i.e. runs of non-ASCII characters and ASCII letters, digits, and underscore), drop Cf characters, then verify identifier syntax. IOW, you couldn't put the formatting characters around whitespace, keywords, or punctuation. An alternative implementation would be to drop formatting characters everywhere except in string literals. I'll repeat that UTR#39 explicitly discourages support for formatting characters in identifiers. Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com