[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Jim J. Jewett Sun, 14 Nov 2021 09:45:15 -0800

ptmcg＠austin.rr.com wrote:

> ...  add a cautionary section on homoglyphs, specifically citing
> “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA)
> as an example problem pair.


There is a unicode tech report about confusables, but it is never clear where 
to stop.  Are I (upper case I), l (lower case l) and 1 (numeric 1) from ASCII 
already a problem?  And if we do it at all, is there any way to avoid making 
Cyrillic languages second-class?

I'm not quickly finding the contemporary report, but these should be helpful if 
you want to go deeper:

    http://www.unicode.org/reports/tr36/
    http://unicode.org/reports/tr36/confusables.txt
    https://util.unicode.org/UnicodeJsps/confusables.jsp


> I wanted to look a little further at the use of characters in identifiers 
> beyond the standard 7-bit ASCII, and so I found some of these same 
> issues dealing with Unicode NFKC normalization. The first discovery was 
> the overlapping normalization of “ªº” with “ao”. 

Here I don't see the problem.  Things that look slightly different are really 
the same, and you can write it either way.  So you can use what looks like a 
funny font, but the closest it comes to a security risk is that maybe you could 
access something without a casual reader realizing that you are doing so.  They 
would know that you *could* access it, just not that you *did*.

> Some other discoveries:
> “·” (ASCII 183) is a valid identifier body character, making “_···” a valid
> Python identifier.

That and the apostrophe are Unicode consortium regrets, because they are 
normally punctuation, but there are also languages that use them as letters. 
 The apostrophe is (supposedly only) used by Afrikaans, I asked a native 
speaker about where/how often it was used, and the similarity to Dutch was 
enough that Guido felt comfortable excluding it.  (It *may* have been similar 
to using the apostrophe for a contraction in English, and saying it therefore 
represents a letter, but the scope was clearly smaller.)  But the dot is used 
in Catalan, and ... we didn't find anyone ready to say it wouldn't be needed 
for sensible identifiers.  It is worth listing as a warning, and linters should 
probably complain.

> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode 
> characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏＿”)
> can only be used as identifier body characters. “︳” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.

So go ahead and warn, but it isn't clear how that could be abused to look like 
something other than a syntax error, except maybe through soft keywords.  (Ha!  
I snuck in a call to async︳def that had been imported with *, and you didn't 
worry about the import *, or the apparently wild cursor position marker, or the 
strange async definition that was never used!  No way I could have just issued 
a call to _flush and done the same thing!)

> Potential beneficial uses:
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups 
> instead of colors. Module names using characters from one group,
> builtins from another, program variables from another, maybe 
> distinguish local from global variables. Colorizing has always been an
> obvious syntax highlight feature, but is an accessibility issue for those
> with difficulty distinguishing colors.

I kind of like the idea, but ... if you're doing it on-the-fly in the editor, 
you could just use different fonts.  If you're actually saving those changes, 
it seems likely to lead to a lot of spurious diffs if anyone uses a different 
editor.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NPTL43EVT2FF76LXIBBWVHDU6NXH3HF5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Reply via email to