Terry J. Reedy <tjre...@udel.edu> added the comment:

I verified that the test file raises the quoted SyntaxError on 3.2 on Win7. 
This:

>>> "\N{LATIN CAPITAL LETTER GHA}"
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 0-27: unknown Unicode character name

is most likely a result of this:

>>> unicodedata.lookup("LATIN CAPITAL LETTER GHA")
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    unicodedata.lookup("LATIN CAPITAL LETTER GHA")
KeyError: "undefined character name 'LATIN CAPITAL LETTER GHA'"

Although the lookup comes first in nametests.py, it is never executed because 
of the later SyntaxError.

The Reference for string literals says" 
"\N{name} Character named name in the Unicode database"

The doc for unicodedata says
"This module provides access to the Unicode Character Database (UCD) which 
defines character properties for all Unicode characters. The data contained in 
this database is compiled from the UCD version 6.0.0.

The module uses the same names and symbols as defined by Unicode Standard Annex 
#44, “Unicode Character Database”." 
http://www.unicode.org/reports/tr44/tr44-6.html

So the question is, what are the 'names' therein defined?
All such should be valid inputs to 
"unicodedata.lookup(name) Look up character by name."

The annex refers to http://www.unicode.org/Public/6.0.0/ucd/
This contains NamesList.txt, derived from UnicodeData.txt. Unicodedata must be 
using just the latter. The ucd directory also contains NameAliases.txt, 
NamedSequences.txt, and the file of provisional named sequences.

As best I can tell, the annex plus files are a bit ambiguous as to  'Unicode 
character name'. The following quote seems neutral: "the Unicode Character 
Database (UCD), a collection of data files which contain the Unicode character 
code points and character names." The following: "Unicode character names 
constitute a special case. Formally, they are values of the Name property." 
points toward UnicodeData.txt, which lists the Name property along with others. 
However, "Unicode character name, as published in the Unicode names list," 
indirectly points toward including aliases. NamesList.txt says it contains the 
"Final Unicode 6.0 names list." (but one which "should not be parsed for 
machine-readable information". It includes all 11 aliases in NameAliases.txt. 

My current opinion is that adding the aliases might be done in current 
releases. It certainly would serve the any user who does not know to misspell 
'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars.

Adding named sequences is definitely a feature request. The definition of 
.lookup(name) would be enlarged to "Look up character by name, alias, or named 
sequence" with reference to the specific files. The meaning of \N{} would also 
have to be enlarged.

Minimal test code might be:

from unicodedata import lookup
AssertEqual(lookup("LATIN CAPITAL LETTER GHA")), "\u01a2")
AssertEqual(lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE"),
   "\u0100\u0300")
plus a test that "\N{LATIN CAPITAL LETTER GHA}" and
"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" compile without error (I 
have no idea how to write that).

---
> "If you look at the ICU UCharacter class, you can see that they provide a 
> more"

More what ;-)
I presume ICU =International Components for Unicode, icu-project.org/
"Offers a portable set of C/C++ and Java libraries for Unicode support, 
software internationalization (I18N) and globalization (G11N)."
[appears to be free, open source, and possibly usable within Python]

----------
nosy: +terry.reedy
stage: test needed -> needs patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12753>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to