[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Marc-Andre Lemburg Mon, 15 Nov 2021 04:13:11 -0800

On 15.11.2021 12:36, Steven D'Aprano wrote:
> On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:
> 
>> I am, however, surprised and disappointed by the NKFC normalization.
>>
>> For example, in writing math we often use different scripts to mean 
>> different things (e.g. TeX's Blackboard Bold). So if I were to use 
>> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want 
>> them to get normalized.
> 
> Hmmm... would you really want these to all be different identifiers?
> 
>     𝕭 𝓑 𝑩 𝐁 B
> 
> You're assuming the reader of the code has the right typeface to view 
> them (rather than as mere boxes), and that their eyesight is good enough 
> to distinguish the variations even if their editor applies bold or 
> italic as part of syntax highlighting. That's very bold of you :-)
> 
> In any case, the question of NFKC versus NFC was certainly considered, 
> but unfortunately PEP 3131 doesn't document why NFKC was chosen.
> 
> https://www.python.org/dev/peps/pep-3131/
> 
> Before we change the normalisation rules, it would probably be a good 
> idea to trawl through the archives of the mailing list and work out why 
> NFKC was chosen in the first place, or contact Martin von Löwis and see 
> if he remembers.

This was raised in the discussion, but never conclusively answered:

https://mail.python.org/pipermail/python-3000/2007-May/007995.html

NFKC is the standard normalization form when you want remove any
typography related variants/hints from the text before comparing
strings. See http://www.unicode.org/reports/tr15/

I guess that's why Martin chose this form, since the point
was to maintain readability, even if different variants of a
character are used in the source code. A "B" in the source code
should be interpreted as an ASCII B, even when written
as 𝕭 𝓑 𝑩 or 𝐁.

This simplifies writing code and does away with many of the
security issues you could otherwise run into (where e.g. the
absence of an identifier causes the application flow to
be different).

>> Then there's the question of when this normalization happens (and when it
>> doesn't).

It happens in the parser when reading a non-ASCII identifier
(see Parser/pegen.c), so only applies to source code, not attributes
you dynamically add to e.g. class or module namespaces.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 15 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/SNN2WZ3MOH5IACSZVHGS6DKTNMKO5JBV/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Reply via email to