Indeed, normative annex https://www.unicode.org/reports/tr31/tr31-35.html
section 5 says: "if the programming language has case-sensitive
identifiers, then Normalization Form C is appropriate" (vs NFKC for a
language with case-insensitive identifiers) so to follow the standard we
should have used NFC rather than NFKC. Not sure if it's too late to fix
this "oops" in future Python versions.

Alex

On Sun, Nov 14, 2021 at 9:17 AM Christopher Barker <python...@gmail.com>
wrote:

> On Sat, Nov 13, 2021 at 2:03 PM <pt...@austin.rr.com> wrote:
>
>> def ๐š‘๐“ฎ๐–‘๐’๐‘œ():
>>
>>     try:
>>
>>         ๐”ฅe๐—…๐•๐š˜๏ธด = "Hello"
>>
>>         ๐•จ๐”ฌr๐“ตแตˆ๏นŽ = "World"
>>
>>         แต–๐–—๐ข๐˜ฏ๐“ฝ(f"{๐—ต๏ฝ…๐“ต๐”ฉยบ_}, {๐–œโ‚’๐’“lโ…†๏ธด}!")
>>
>>     except ๐“ฃ๐•ชแต–๏ฝ…๐–ค๐—ฟแตฃ๐–”๐š› as โ…‡๐—‘c:
>>
>>         ๐’‘rโ„นโ‚™โ‚œ("failed: {}".๐•—๐—ผสณแตยช๏ฝ”(แต‰๐ฑ๐“ฌ))
>>
>
> Wow. Just Wow.
>
> So why does Python apply  NFKC normalization to variable names?? I can't
> for the life of me figure out why that would be helpful at all.
>
> The string methods, sure, but names?
>
> And, in fact, the normalization is not used for string comparisons or
> hashes as far as I can tell.
>
> In [36]: weird
> Out[36]: 'แต–๐–—๐ข๐˜ฏ๐“ฝ'
>
> In [37]: normal
> Out[37]: 'print'
>
> In [38]: eval(weird + "('yup, that worked')")
> yup, that worked
>
> In [39]: weird == normal
> Out[39]: False
>
> In [40]: weird[0] in normal
> Out[40]: False
>
> This seems very odd (and dangerous) to me.
>
> Is there a good reason? and is it too late to change it?
>
> -CHB
>
>
>
>
>
>
>
>
>
>>
>>
>> if _๏ธดโฟ๐“ช๐‘š๐•–__ == "__main__":
>>
>>     ๐’‰eโ„“หก๐—ˆ()
>>
>>
>>
>>
>>
>> # snippet from unittest/util.py
>>
>> _๐“Ÿโ…ฌ๐– ๐™ฒ๐—˜โ„‹๐’ชLแดฐ๐‘ฌ๐•ฝ๏น๐•ท๐”ผ๐—ก = 12
>>
>> def _๐”ฐสฐ๐“ธสณ๐•ฅ๐™š๐‘›(๐”ฐ, p๐‘Ÿ๐”ข๏ฌ๐–๐•๐šŽ๐‘›, ๏ฝ“แตค๐‘“๐—ณ๐—‚๐‘ฅ๐—นโ‚‘๐š—):
>>
>>     หข๐—ธ๏ฝ‰๐—ฝ = ๐ฅ๏ฝ…๐˜ฏ(๐–˜) - ๏ฝr๐šŽ๐–‹๐ขx๐—…แต‰๐“ท - ๐’”๐™ช๏ฌ€๏ฝ‰๐˜…๐—น๐™šโ‚™
>>
>>     if s๏ฝ‹i๐˜ฑ > _๐๐—Ÿ๐– ๐˜Š๐™ดH๐•บ๏ผฌ๐•ฏ๐™€๐˜™๏นL๐”ˆ๐’ฉ:
>>
>>         ๐˜ด = '%s[%d chars]%s' % (๐™จ[:๐˜ฑ๐ซ๐•–๐‘“๐•š๏ฝ˜โ„“๐’†๐•Ÿ], โ‚›๐š”๐’Šp, ๐“ผ[๐“๐’†๐–“
>> (๐šœ) - ๐™จ๐šž๐’‡๏ฌx๐™กแต‰๐˜ฏ:])
>>
>>     return โ‚›
>>
>>
>>
>>
>>
>> You should able to paste these into your local UTF-8-aware editor or IDE
>> and execute them as-is.
>>
>>
>>
>> (If this doesnโ€™t come through, you can also see this as a GitHub gist at 
>> Hello,
>> World rendered in a variety of Unicode characters (github.com)
>> <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have
>> a second gist containing the transformer, but it is still a private gist
>> atm.)
>>
>>
>>
>>
>>
>> Some other discoveries:
>>
>> โ€œยทโ€ (ASCII 183) is a valid identifier body character, making โ€œ_ยทยทยทโ€ a
>> valid Python identifier. This could actually be another security attack
>> point, in which โ€œsยทjoin(โ€˜xโ€™)โ€ could be easily misread as โ€œs.join(โ€˜xโ€™)โ€, but
>> would actually be a call to potentially malicious method โ€œsยทjoinโ€.
>>
>> โ€œ_โ€ seems to be a special case for normalization. Only the ASCII โ€œ_โ€
>> character is valid as a leading identifier character; the Unicode
>> characters that normalize to โ€œ_โ€ (any of the characters in โ€œ๏ธณ๏ธด๏น๏นŽ๏น๏ผฟโ€) can
>> only be used as identifier body characters. โ€œ๏ธณโ€ especially could be
>> misread as โ€œ|โ€ followed by a space, when it actually normalizes to โ€œ_โ€.
>>
>>
>>
>>
>>
>> Potential beneficial uses:
>>
>> I am considering taking my transformer code and experimenting with an
>> orthogonal approach to syntax highlighting, using Unicode groups instead of
>> colors. Module names using characters from one group, builtins from
>> another, program variables from another, maybe distinguish local from
>> global variables. Colorizing has always been an obvious syntax highlight
>> feature, but is an accessibility issue for those with difficulty
>> distinguishing colors. Unlike the โ€œransom noteโ€ code above, code
>> highlighted in this way might even be quite pleasing to the eye.
>>
>>
>>
>>
>>
>> -- Paul McGuire
>>
>>
>>
>>
>> _______________________________________________
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-le...@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
>
> --
> Christopher Barker, PhD (Chris)
>
> Python Language Consulting
>   - Teaching
>   - Scientific Software Development
>   - Desktop GUI and Web Development
>   - wxPython, numpy, scipy, Cython
>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/U3DJOQKMREWY35SHCRSD6V6FQA2T3SW7/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to