On 11/1/2021 8:17 AM, Petr Viktorin wrote:
Nevertheless, I did do a bit of research about similar gotchas in
Python, and I'd like to publish a summary as an informational PEP,
pasted below.
Very helpful.
Bidirectional Text
------------------
Some scripts, such as Hebrew or Arabic, are written right-to-left.
[Suggested addition, subject to further revision.]
There are at least three levels of handling r2l chars: none, local
(contiguous sequences are properly reversed), and extended (see below).
The handling depends on the display software and may depend on the
quoting. Tk and hence tkinter (and IDLE) text widgets do local handing.
Windows Notepad++ does local handling of unquoted code but extending
handling of quoted text. Windows Notepad currently does extended
handling even without quotes.
In extended handling, phrases ...
Phrases in such scripts interact with nearby text in ways that can be
surprising to people who aren't familiar with these writing systems
and their
computer representation.
The exact process is complicated, and explained in Unicode® Standard
Annex #9,
"Unicode Bidirectional Algorithm".
Some surprising examples include:
* In the statement ``ערך = 23``, the variable ``ערך`` is set to the
integer 23.
In local handling, one sees <hebrew-rtl> = 23`. In extended handling,
one sees 23 = <hebrew-rtl>. (Notepad++ sees backticks as quotes.)
Source Encoding
---------------
The encoding of Python source files is given by a specific regex on
the first
two lines of a file, as per `Encoding declarations`_.
This mechanism is very liberal in what it accepts, and thus easy to
obfuscate.
This can be misused in combination with Python-specific special-purpose
encodings (see `Text Encodings`_).
Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to
something?
For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::
I don't see the connection between the text above and the example that
follows.
# For writing Japanese, you don't need an editor that supports
# UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]
--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/34JROXNUHEUDC4TOWUAM74KIGIRRHHG4/
Code of Conduct: http://python.org/psf/codeofconduct/