[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Serhiy Storchaka Wed, 03 Nov 2021 08:50:31 -0700

03.11.21 14:31, Petr Viktorin пише:
> For example: should the parser emit a lightweight audit event if it
> finds a non-ASCII identifier? (See below for why ASCII is special.)
> Or for encoding declarations?


There are audit events for import and compile. You can also register
import hooks if you want more fanny preprocessing than just
unicode-encoding. I do not think we need to add more specific audit
events, they were not designed for this.

And I think it is too late to detect suspicious code at the time of its
execution. It should be detected before adding that code to the code
base (review tools, pre-commit hooks).

> I don't think this would actually ban Cyrillic/Greek.
> (My suggestion is not vanilla confusables detection; it might require
> careful reading: "should there be a [linter] warning when an identifier
> looks like ASCII but isn't?")

Yes, but it should be optional and configurable and not be the part of
the Python compiler. This is not our business as Python core developers.

> I am not a native speaker, but I did try a bit to find an actual
> ASCII-like word in a language that uses Cyrillic. I didn't succeed; I
> think they might be very rare.

With simple script I have found 62 words common between English and
Ukrainian: гасу/racy, горе/rope, рима/puma, міх/mix, etc. But there are
much more English and Ukrainian words which contains only letters which
can be confused with letters from other script. And identifiers can
contains abbreviations and shortening, they are not all can be found in
dictionaries.

> Even if there was such a word -- or a one-letter abbreviation used as a
> variable name -- it would be confusing to use. Removing the possibility
> of confusion could *help* Cyrillic users. (I can't speak for them; this
> is just a brainstorming idea.)

I never used non-Latin identifiers in Python, but I guess that where
they are used (in schools?) there is a mix of English and non-English
identifiers, and identifiers consisting of parts of English and
non-English words without even an underscore between them. I know
because in other languages they just use inconsistent transliteration.
Emitting any warning by default is a discrimination of non-English
users. It would be better to not add support of non-ASCII identifiers at
first place.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/XHHXRWGKTDTZIYGS6AB3DKEVFH5D6BHV/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Reply via email to