[Python-Dev] Re: [Python-checkins] bpo-41944: No longer call eval() on content received via HTTP in the UnicodeNames tests (GH-22575)

Walter Dörwald Wed, 07 Oct 2020 04:05:17 -0700

On 7 Oct 2020, at 1:27, Victor Stinner wrote:

Hi Walter,
https://github.com/python/cpython/commit/a8bf44d04915f7366d9f8dfbf84822ac37a4bab3
Le mar. 6 oct. 2020 à 17:02, Walter Dörwald <wal...@livinglogic.de>a écrit :
It would be even simpler to use unicodedata.lookup() which returnsthe unicode character when passed the name of the character
That was my first idea as well when I reviewed the change, but the
function contains this comment:

    def checkletter(self, name, code):
        # Helper that put all \N escapes inside eval'd raw strings,
        # to make sure this script runs even if the compiler
        # chokes on \N escapes

test_named_sequences_full() checks that unicodedata.lookup() works,


OK, that change would then have checked unicodedata.lookup() twice.

However I'm puzzled by the fact that the "\N{}" escape sequence issupposed to raise a SyntaxError. And indeed it does in some cases:


Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import unicodedata
unicodedata.lookup("DIGIT ZERO")

'0'

"\N{DIGIT ZERO}"

'0'

"\N{EURO SIGN}"

'€'

unicodedata.lookup("EURO SIGN")

'€'

unicodedata.lookup("KEYCAP NUMBER SIGN")

'#️⃣'

"\N{KEYCAP NUMBER SIGN}"

  File "<stdin>", line 1

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes inposition 0-21: unknown Unicode character name

unicodedata.lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")

'Ā̀'

"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"

  File "<stdin>", line 1

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes inposition 0-47: unknown Unicode character name

It seems that unicodedata.lookup() honors "Code point sequences", but\N{} does not.

Indeedhttps://docs.python.org/3/library/unicodedata.html#unicodedata.lookup

mentions that fact:

Changed in version 3.3: Support for name aliases and named sequenceshas been added.


https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

doesn't mention anything. It simply states

Escape Sequence: \N{name}
Meaning:         Character named name in the Unicode database

with the footnote "Changed in version 3.3: Support for name aliases hasbeen added.".


Which leads to the question:

        Should \N{} be updated to support "Code point sequences"?

Furthermore it states: "Unlike Standard C, all unrecognized escapesequences are left in the string unchanged", which could be interpretedas meaning that "\N{BAD}" results in "\\N{BAD}".

but that checkletter() raises a SyntaxError. Look at the code ;-)


That would have helped. ;)

Victor


Servus,
   Walter

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RNZUBXZ3WGIQ57CONGFEVEPM4NFS5CWW/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: [Python-checkins] bpo-41944: No longer call eval() on content received via HTTP in the UnicodeNames tests (GH-22575)

Reply via email to