Michael B. Allen wrote on 2002-04-01 19:56 UTC:
> If I want to count characters (rather than screen positions or bytes) I
> must know how to define a character.

For what purpose do you want to count characters? Most of the time,
people really are interested in counting either screen positions or
bytes. If you count characters, then you have to decide whether you want
to count only graphical base characters or all graphical characters or
all characters. You also have to decide whether you want to turn the
string first into a normalization form before you start counting. All
these questions are difficult to answer without knowing why you want to
have the count.

If you want to count an arbitrary subset of Unicode characters based on
character range, character category, etc., then you can easily use the
"uniset" software that I used to generate the combining characters table
in wcwidth.c:

  http://www.cl.cam.ac.uk/~mgk25/download/uniset.tar.gz

You'll find the documentation of the Unicode character categories in

  http://www.unicode.org/Public/UNIDATA/UnicodeData.html#General%20Category

and in more detail in the Unicode 3.0 book.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to