[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

Terry Reedy Mon, 01 Nov 2021 18:12:03 -0700

On 11/1/2021 8:17 AM, Petr Viktorin wrote:

Nevertheless, I did do a bit of research about similar gotchas inPython, and I'd like to publish a summary as an informational PEP,pasted below.


Very helpful.

Bidirectional Text
------------------

Some scripts, such as Hebrew or Arabic, are written right-to-left.


[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local(contiguous sequences are properly reversed), and extended (see below).The handling depends on the display software and may depend on thequoting. Tk and hence tkinter (and IDLE) text widgets do local handing.Windows Notepad++ does local handling of unquoted code but extendinghandling of quoted text. Windows Notepad currently does extendedhandling even without quotes.


In extended handling, phrases ...

Phrases in such scripts interact with nearby text in ways that can be
surprising to people who aren't familiar with these writing systemsand their
computer representation.
The exact process is complicated, and explained in Unicode® StandardAnnex #9,
"Unicode Bidirectional Algorithm".

Some surprising examples include:
* In the statement ``ערך = 23``, the variable ``ערך`` is set to theinteger 23.


In local handling, one sees <hebrew-rtl> = 23`.  In extended handling,
one sees 23 = <hebrew-rtl>.  (Notepad++ sees backticks as quotes.)

Source Encoding
---------------
The encoding of Python source files is given by a specific regex onthe first
two lines of a file, as per `Encoding declarations`_.
This mechanism is very liberal in what it accepts, and thus easy toobfuscate.
This can be misused in combination with Python-specific special-purpose
encodings (see `Text Encodings`_).

Are `Encoding declarations`_ and `Text Encodings`_ supposed to link tosomething?

For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::

I don't see the connection between the text above and the example thatfollows.

    # For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.

[etc]


--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/34JROXNUHEUDC4TOWUAM74KIGIRRHHG4/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

Reply via email to