On 7 Oct 2020, at 1:27, Victor Stinner wrote:

Hi Walter,

https://github.com/python/cpython/commit/a8bf44d04915f7366d9f8dfbf84822ac37a4bab3

Le mar. 6 oct. 2020 à 17:02, Walter Dörwald <wal...@livinglogic.de> a écrit :
It would be even simpler to use unicodedata.lookup() which returns the unicode character when passed the name of the character

That was my first idea as well when I reviewed the change, but the
function contains this comment:

    def checkletter(self, name, code):
        # Helper that put all \N escapes inside eval'd raw strings,
        # to make sure this script runs even if the compiler
        # chokes on \N escapes

test_named_sequences_full() checks that unicodedata.lookup() works,

OK, that change would then have checked unicodedata.lookup() twice.

However I'm puzzled by the fact that the "\N{}" escape sequence is supposed to raise a SyntaxError. And indeed it does in some cases:

Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import unicodedata
unicodedata.lookup("DIGIT ZERO")
'0'
"\N{DIGIT ZERO}"
'0'
"\N{EURO SIGN}"
'€'
unicodedata.lookup("EURO SIGN")
'€'
unicodedata.lookup("KEYCAP NUMBER SIGN")
'#️⃣'
"\N{KEYCAP NUMBER SIGN}"
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-21: unknown Unicode character name
unicodedata.lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")
'Ā̀'
"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-47: unknown Unicode character name

It seems that unicodedata.lookup() honors "Code point sequences", but \N{} does not.

Indeed https://docs.python.org/3/library/unicodedata.html#unicodedata.lookup
mentions that fact:

Changed in version 3.3: Support for name aliases and named sequences has been added.

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

doesn't mention anything. It simply states

Escape Sequence: \N{name}
Meaning:         Character named name in the Unicode database

with the footnote "Changed in version 3.3: Support for name aliases has been added.".

Which leads to the question:

        Should \N{} be updated to support "Code point sequences"?

Furthermore it states: "Unlike Standard C, all unrecognized escape sequences are left in the string unchanged", which could be interpreted as meaning that "\N{BAD}" results in "\\N{BAD}".

but that checkletter() raises a SyntaxError. Look at the code ;-)

That would have helped. ;)

Victor

Servus,
   Walter
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RNZUBXZ3WGIQ57CONGFEVEPM4NFS5CWW/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to