unicodedata.name<http://unicodedata.name> raises KeyError for a few unicode
characters like '\0' or '\n', altough the documentation is very clear on the
implementation, this is often not what people want, ie. a string describing the
character.
In Python 3.3, the name aliases became accepted in unicodedata.lookup('NULL')
and '\N{NULL}' == '\N{NUL}'.
One could expect that lookup(name(x)) == x for all unicode character but this
property doesn't hold because of the few characters that do not have a name
(mainly control characters).
The use case where the KeyError is raised when a codepoint for a unused
character or newest version of unicode is however still useful.
In the NameAliases https://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt one
can see that some characters have multiple aliases, so there are multiple ways
to map a character to a name.
I propose adding a keyword argument, to
unicodedata.name<http://unicodedata.name> that would implement one of some
useful behavior when the value does not exist. In that case.
One simple behavior would be to chose the name in the "abbreviation" list.
Currently all characters except three only have one and only one abbreviation
so that would be a good pick, so I'd imagine name('\x00', abbreviation=True) ==
'NUL'
The three characters in NameAlias.txt that have more than one abbreviation are :
'\n' with ['LF', 'NL', 'EOL']
'\t' with ['HT', 'TAB']
'\ufeff' with ['BOM', 'ZWNBSP']
In case multiple abbreviations exist, one could take the first introduced to
unicode (for backward compability with python versions). If this is a tie, one
could take the first in the list. If it has no name and no abbreviation,
unicodata.name<http://unicodata.name> raises an error or returns default as
usual.
lookup(name(x)) == x for all x is natural isn't it ?
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/