ptmcg@austin.rr.com wrote: > ... add a cautionary section on homoglyphs, specifically citing > “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA) > as an example problem pair.
There is a unicode tech report about confusables, but it is never clear where to stop. Are I (upper case I), l (lower case l) and 1 (numeric 1) from ASCII already a problem? And if we do it at all, is there any way to avoid making Cyrillic languages second-class? I'm not quickly finding the contemporary report, but these should be helpful if you want to go deeper: http://www.unicode.org/reports/tr36/ http://unicode.org/reports/tr36/confusables.txt https://util.unicode.org/UnicodeJsps/confusables.jsp > I wanted to look a little further at the use of characters in identifiers > beyond the standard 7-bit ASCII, and so I found some of these same > issues dealing with Unicode NFKC normalization. The first discovery was > the overlapping normalization of “ªº” with “ao”. Here I don't see the problem. Things that look slightly different are really the same, and you can write it either way. So you can use what looks like a funny font, but the closest it comes to a security risk is that maybe you could access something without a casual reader realizing that you are doing so. They would know that you *could* access it, just not that you *did*. > Some other discoveries: > “·” (ASCII 183) is a valid identifier body character, making “_···” a valid > Python identifier. That and the apostrophe are Unicode consortium regrets, because they are normally punctuation, but there are also languages that use them as letters. The apostrophe is (supposedly only) used by Afrikaans, I asked a native speaker about where/how often it was used, and the similarity to Dutch was enough that Guido felt comfortable excluding it. (It *may* have been similar to using the apostrophe for a contraction in English, and saying it therefore represents a letter, but the scope was clearly smaller.) But the dot is used in Catalan, and ... we didn't find anyone ready to say it wouldn't be needed for sensible identifiers. It is worth listing as a warning, and linters should probably complain. > “_” seems to be a special case for normalization. Only the ASCII “_” > character is valid as a leading identifier character; the Unicode > characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) > can only be used as identifier body characters. “︳” especially could be > misread as “|” followed by a space, when it actually normalizes to “_”. So go ahead and warn, but it isn't clear how that could be abused to look like something other than a syntax error, except maybe through soft keywords. (Ha! I snuck in a call to async︳def that had been imported with *, and you didn't worry about the import *, or the apparently wild cursor position marker, or the strange async definition that was never used! No way I could have just issued a call to _flush and done the same thing!) > Potential beneficial uses: > I am considering taking my transformer code and experimenting with an > orthogonal approach to syntax highlighting, using Unicode groups > instead of colors. Module names using characters from one group, > builtins from another, program variables from another, maybe > distinguish local from global variables. Colorizing has always been an > obvious syntax highlight feature, but is an accessibility issue for those > with difficulty distinguishing colors. I kind of like the idea, but ... if you're doing it on-the-fly in the editor, you could just use different fonts. If you're actually saving those changes, it seems likely to lead to a lot of spurious diffs if anyone uses a different editor. -jJ _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NPTL43EVT2FF76LXIBBWVHDU6NXH3HF5/ Code of Conduct: http://python.org/psf/codeofconduct/