Package: unicode
Version: 2.8-1.1
Severity: normal

The unicode tool fails to properly construct some systematic names that are abbreviated in the UnicodeData.txt file.

It completely fails to do it in Tangut and Tangut Supplement blocks:

$ unicode --brief 17a98
𗪘 U+17A98  - No such unicode character name in database

The name above should be "TANGUT IDEOGRAPH-17A98".
Other properties except the name are listed correctly.

Even in ranges where systematic names are derived correctly, unicode still displays the UnicodeData meta-label instead of the character name for the first and the last character:

$ unicode --brief  ac00 ac01 d7a3  3400 3401 4dbf
가 U+AC00 <Hangul Syllable, First>
각 U+AC01 HANGUL SYLLABLE GAG
힣 U+D7A3 <Hangul Syllable, Last>
㐀 U+3400 <CJK Ideograph Extension A, First>
㐁 U+3401 CJK UNIFIED IDEOGRAPH-3401
䶿 U+4DBF <CJK Ideograph Extension A, Last>

The missing names should be:
U+AC00 HANGUL SYLLABLE GA
U+D7A3 HANGUL SYLLABLE HIH
U+3400 CJK UNIFIED IDEOGRAPH-3400
U+4DBF CJK UNIFIED IDEOGRAPH-4DBF

Leaving UnicodeData meta-label might make some sense in case of control characters*, but it doesn't make any sense for ranges that have systematic names defined by generative rules and are abbreviated in UnicodeData only to save space.

* It would probably be better if controls and other code points with no name followed the convention described in Unicode §4.8 for code point labels, i.e. <control-0009> for U+0009 instead of just <control>, and other labels as appropriate for reserved, noncharacter, private use, and surrogates instead of just " - No such unicode character name in database", but that would be a separate feature request that I don't care enough about to make it.

-k


-- System Information:
Debian Release: bookworm/sid
  APT prefers testing
  APT policy: (900, 'testing'), (700, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 5.15.0-3-amd64 (SMP w/4 CPU threads)
Kernel taint flags: TAINT_WARN, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=pl_PL.UTF-8, LC_CTYPE=pl_PL.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)

Versions of packages unicode depends on:
ii  python3  3.9.8-1

Versions of packages unicode recommends:
ii  unicode-data  14.0.0-1.1

Versions of packages unicode suggests:
ii  bzip2  1.0.8-5

-- no debconf information

Reply via email to