On 11/1/2021 8:17 AM, Petr Viktorin wrote:

Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below.

Very helpful.

Bidirectional Text
------------------

Some scripts, such as Hebrew or Arabic, are written right-to-left.

[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below). The handling depends on the display software and may depend on the quoting. Tk and hence tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local handling of unquoted code but extending handling of quoted text. Windows Notepad currently does extended handling even without quotes.

In extended handling, phrases ...

Phrases in such scripts interact with nearby text in ways that can be
surprising to people who aren't familiar with these writing systems and their
computer representation.

The exact process is complicated, and explained in Unicode® Standard Annex #9,
"Unicode Bidirectional Algorithm".

Some surprising examples include:

* In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.

In local handling, one sees <hebrew-rtl> = 23`.  In extended handling,
one sees 23 = <hebrew-rtl>.  (Notepad++ sees backticks as quotes.)


Source Encoding
---------------

The encoding of Python source files is given by a specific regex on the first
two lines of a file, as per `Encoding declarations`_.
This mechanism is very liberal in what it accepts, and thus easy to obfuscate.

This can be misused in combination with Python-specific special-purpose
encodings (see `Text Encodings`_).


Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to something?


For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::

I don't see the connection between the text above and the example that follows.

    # For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]


--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/34JROXNUHEUDC4TOWUAM74KIGIRRHHG4/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to