ptmcg@austin.rr.com wrote:
> ... add a cautionary section on homoglyphs, specifically citing
> “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA)
> as an example problem pair.
There is a unicode tech report about confusables, but it is never clear where
to stop. Are I (upper case I), l (lower case l) and 1 (numeric 1) from ASCII
already a problem? And if we do it at all, is there any way to avoid making
Cyrillic languages second-class?
I'm not quickly finding the contemporary report, but these should be helpful if
you want to go deeper:
http://www.unicode.org/reports/tr36/
http://unicode.org/reports/tr36/confusables.txt
https://util.unicode.org/UnicodeJsps/confusables.jsp
> I wanted to look a little further at the use of characters in identifiers
> beyond the standard 7-bit ASCII, and so I found some of these same
> issues dealing with Unicode NFKC normalization. The first discovery was
> the overlapping normalization of “ªº” with “ao”.
Here I don't see the problem. Things that look slightly different are really
the same, and you can write it either way. So you can use what looks like a
funny font, but the closest it comes to a security risk is that maybe you could
access something without a casual reader realizing that you are doing so. They
would know that you *could* access it, just not that you *did*.
> Some other discoveries:
> “·” (ASCII 183) is a valid identifier body character, making “_···” a valid
> Python identifier.
That and the apostrophe are Unicode consortium regrets, because they are
normally punctuation, but there are also languages that use them as letters.
The apostrophe is (supposedly only) used by Afrikaans, I asked a native
speaker about where/how often it was used, and the similarity to Dutch was
enough that Guido felt comfortable excluding it. (It *may* have been similar
to using the apostrophe for a contraction in English, and saying it therefore
represents a letter, but the scope was clearly smaller.) But the dot is used
in Catalan, and ... we didn't find anyone ready to say it wouldn't be needed
for sensible identifiers. It is worth listing as a warning, and linters should
probably complain.
> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode
> characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”)
> can only be used as identifier body characters. “︳” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.
So go ahead and warn, but it isn't clear how that could be abused to look like
something other than a syntax error, except maybe through soft keywords. (Ha!
I snuck in a call to async︳def that had been imported with *, and you didn't
worry about the import *, or the apparently wild cursor position marker, or the
strange async definition that was never used! No way I could have just issued
a call to _flush and done the same thing!)
> Potential beneficial uses:
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups
> instead of colors. Module names using characters from one group,
> builtins from another, program variables from another, maybe
> distinguish local from global variables. Colorizing has always been an
> obvious syntax highlight feature, but is an accessibility issue for those
> with difficulty distinguishing colors.
I kind of like the idea, but ... if you're doing it on-the-fly in the editor,
you could just use different fonts. If you're actually saving those changes,
it seems likely to lead to a lot of spurious diffs if anyone uses a different
editor.
-jJ
_______________________________________________
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/NPTL43EVT2FF76LXIBBWVHDU6NXH3HF5/
Code of Conduct: http://python.org/psf/codeofconduct/