On 7 Oct 2020, at 1:27, Victor Stinner wrote:
Hi Walter,
https://github.com/python/cpython/commit/a8bf44d04915f7366d9f8dfbf84822ac37a4bab3
Le mar. 6 oct. 2020 à 17:02, Walter Dörwald <wal...@livinglogic.de>
a écrit :
It would be even simpler to use unicodedata.lookup() which returns
the unicode character when passed the name of the character
That was my first idea as well when I reviewed the change, but the
function contains this comment:
def checkletter(self, name, code):
# Helper that put all \N escapes inside eval'd raw strings,
# to make sure this script runs even if the compiler
# chokes on \N escapes
test_named_sequences_full() checks that unicodedata.lookup() works,
OK, that change would then have checked unicodedata.lookup() twice.
However I'm puzzled by the fact that the "\N{}" escape sequence is
supposed to raise a SyntaxError. And indeed it does in some cases:
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import unicodedata
unicodedata.lookup("DIGIT ZERO")
'0'
"\N{DIGIT ZERO}"
'0'
"\N{EURO SIGN}"
'€'
unicodedata.lookup("EURO SIGN")
'€'
unicodedata.lookup("KEYCAP NUMBER SIGN")
'#️⃣'
"\N{KEYCAP NUMBER SIGN}"
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 0-21: unknown Unicode character name
unicodedata.lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")
'Ā̀'
"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 0-47: unknown Unicode character name
It seems that unicodedata.lookup() honors "Code point sequences", but
\N{} does not.
Indeed
https://docs.python.org/3/library/unicodedata.html#unicodedata.lookup
mentions that fact:
Changed in version 3.3: Support for name aliases and named sequences
has been added.
https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
doesn't mention anything. It simply states
Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database
with the footnote "Changed in version 3.3: Support for name aliases has
been added.".
Which leads to the question:
Should \N{} be updated to support "Code point sequences"?
Furthermore it states: "Unlike Standard C, all unrecognized escape
sequences are left in the string unchanged", which could be interpreted
as meaning that "\N{BAD}" results in "\\N{BAD}".
but that checkletter() raises a SyntaxError. Look at the code ;-)
That would have helped. ;)
Victor
Servus,
Walter
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/RNZUBXZ3WGIQ57CONGFEVEPM4NFS5CWW/
Code of Conduct: http://python.org/psf/codeofconduct/