New submission from Shriramana Sharma:

Currently we have unicodedata.name() which returns the formal character name of 
the character chr as per the second column in UnicodeData.txt from 
http://www.unicode.org/Public/UNIDATA/.

However, there are a few characters where the formal character name has 
spelling mistakes. Also, the control characters in the Basic Latin and Latin-1 
blocks aren't really given meaningful character names. In one case, that of 
FEFF, the formal name ZERO WIDTH NO-BREAK SPACE refers to a deprecated usage of 
the character (and the alternate name BYTE ORDER MARK refers to the recommended 
usage).

In all these cases, improved names are provided as stable aliases in 
NameAliases.txt from the same UNIDATA source. These are also part of the stable 
standard and are intended to alleviate the naming situation w.r.t. the above 
issues. For the stability, see: 
http://www.unicode.org/policies/stability_policy.html#Formal_Name_Alias

Hence it would be most useful if the unicodedata module would add an 
aliasedname() method with the same signature as name() to provide the official 
aliased name in the case of characters with aliases, and when a character does 
not have an alias, to provide the same output as name().

As of Py 3.3, unicodedata.lookup() already uses/supports NameAliases.txt for 
returning the character given the name. The present requirement is to use it 
for returning the name given the character.

Note that NameAliases.txt has abbreviated names for some characters (where the 
third column reads "abbreviation"). While these would be useful for lookup(), 
they would not be useful to be returned for aliasedname(). For instance, one 
would prefer to see "SPACE" returned for 0020 rather than "SP". So these 
entries should be disregarded for aliasedname().

Also, NameAliases.txt has multiple entries for some characters even after 
discarding the abbreviation entries. In these cases, the first entry should be 
used (for want of a better rule). It is presumed that these are provided in 
some order of preference.

It should be noted that discussion on this topic on the "unicore" (Unicode 
members) mailing list (on the thread "When normative aliases exist..." started 
2014-01-21) indicates that the order of entries is subject to change although 
the entries themselves will not be removed. In this case, the first 
non-abbreviation entry may change. This is acceptable for the behaviour of 
aliasedname(). Also note that aliases may be defined in future. Thus the string 
returned by aliasedname() for a given character is not guaranteed to be the 
same, but whatever is returned by it will surely be valid to use with lookup(). 
Those who desire a single immutable name and do not require the improvements 
provided by the aliases should use name() and not aliasedname().

Finally, for extended support, a namealiases() function should return all the 
aliases together with their types, allowing the user full choice of the desired 
but official alias.

The attached code should clarify the required behaviour. (It is not a patch, 
just an illustration.)

----------
components: Unicode
files: aliasedname.py
messages: 209618
nosy: ezio.melotti, haypo, jamadagni
priority: normal
severity: normal
status: open
title: add aliasedname() and namedaliases() methods to unicodedata module
type: enhancement
versions: Python 3.3
Added file: http://bugs.python.org/file33788/aliasedname.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue20433>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to