Re: character properties

Markus Kuhn Tue, 26 Sep 2000 02:28:37 -0700
Marcin 'Qrczak' Kowalczyk wrote on 2000-09-26 08:26 UTC:
> Sorry, I don't understand. There are indeed charcters with category
> other than Lu or Lt but toLower c != c (roman numerals and circled
> capital letters). Are you saying that they should be isUpper? They
> are not considered letters! I think everybody assumes that if isUpper
> or isLower then isAlpha.

All this is rather academic anyway:

The Roman numerals in particular and to some extend also the circled
letters are in Unicode primarily for round-trip compatibility with
existing standards. You are not supposed to use them in practice and
making toLower c = c and toUpper c = c for them is more than perfectly
acceptable (often even preferable!) in practice for most applications.
Don't think in terms of covering every esoteric detail aspect of the
Unicode tables alone, instead try to imagine what people want to use
these functions for in real applications. Also keep in mind that the
functions were originally copied from their libc equivalents, which were
designed primarily with ASCII in mind. They are not necessarily
particularly useful or meaningful for Unicode. So don't expect to much.
If people are really concerned about the more exotic aspects of the
Unicode tables, then they are most likely well familiar with the tables
and want only direct access to the category codes, and not predicates
based on more dubious philosophic discussions on what belongs into
isSpace and what doesn't.

Titlecase is also something that will not matter in real applications.
There are only two types of titlecase characters anyway:

01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J
01CB LATIN CAPITAL LETTER N WITH SMALL LETTER J
01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z

The practical reasons for why these four characters were included still
escape me. Yes, I too have read the blurb about one-to-one character
transliteration is the standard and I bet with you that you will find
only their decomposed equivalents in > 98% of all Croatian Unicode
files.

The only other 27 title case characters appear all in the Greek
Polytonic alphabet, which is also used only very infrequently today.

Both applications are so rare and exotic that even a Serbian, Croatian
or Greek a programming language reference manual reader hardly ever
would want to be bothered with them. My expected number of total users
of toTitle in your library over the next ten years is around 0.04.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: character properties

Reply via email to