I’ve not been following the thread, but Steve Holden forwarded me the email 
from Petr Viktorin, that I might share some of the info I found while recently 
diving into this topic.

 

As part of working on the next edition of “Python in a Nutshell” with Steve, 
Alex Martelli, and Anna Ravencroft, Alex suggested that I add a cautionary 
section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “Α” 
(GREEK CAPITAL LETTER ALPHA) as an example problem pair. I wanted to look a 
little further at the use of characters in identifiers beyond the standard 
7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC 
normalization. The first discovery was the overlapping normalization of “ªº” 
with “ao”. This was quite a shock to me, since I assumed that the inclusion of 
Unicode for identifier characters would preserve the uniqueness of the 
different code points. Even ligatures can be used, and will overlap with their 
multi-character ASCII forms. So we have added a second note in the upcoming 
edition on the risks of using these “homonorms” (which is a word I just made up 
for the occasion).

 

To explore the extreme case, I wrote a pyparsing transformer to convert 
identifiers in a body of Python source to mixed font, equivalent to the 
original source after NFKC normalization. Here are hello.py, and a snippet from 
unittest/utils.py:

 

def 𝚑𝓮𝖑𝒍𝑜():

    try:

        𝔥e𝗅𝕝𝚘︴ = "Hello"

        𝕨𝔬r𝓵ᵈ﹎ = "World"

        ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")

    except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:

        𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))

 

if _︴ⁿ𝓪𝑚𝕖__ == "__main__":

    𝒉eℓˡ𝗈()

 

 

# snippet from unittest/util.py

_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12

def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):

    ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ

    if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:

        𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])

    return ₛ

 

 

You should able to paste these into your local UTF-8-aware editor or IDE and 
execute them as-is.

 

(If this doesn’t come through, you can also see this as a GitHub gist at Hello, 
World rendered in a variety of Unicode characters (github.com) 
<https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466> . I have a 
second gist containing the transformer, but it is still a private gist atm.)

 

 

Some other discoveries:

“·” (ASCII 183) is a valid identifier body character, making “_···” a valid 
Python identifier. This could actually be another security attack point, in 
which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would 
actually be a call to potentially malicious method “s·join”.

“_” seems to be a special case for normalization. Only the ASCII “_” character 
is valid as a leading identifier character; the Unicode characters that 
normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as 
identifier body characters. “︳” especially could be misread as “|” followed by 
a space, when it actually normalizes to “_”.

 

 

Potential beneficial uses:

I am considering taking my transformer code and experimenting with an 
orthogonal approach to syntax highlighting, using Unicode groups instead of 
colors. Module names using characters from one group, builtins from another, 
program variables from another, maybe distinguish local from global variables. 
Colorizing has always been an obvious syntax highlight feature, but is an 
accessibility issue for those with difficulty distinguishing colors. Unlike the 
“ransom note” code above, code highlighted in this way might even be quite 
pleasing to the eye.

 

 

-- Paul McGuire

 

 

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to