Re: Counting "characters".

Michael B . Allen Tue, 02 Apr 2002 12:00:56 -0800

On Tue, 02 Apr 2002 13:24:25 +0100
Markus Kuhn <[EMAIL PROTECTED]> wrote:


> Michael B. Allen wrote on 2002-04-01 19:56 UTC:
> > If I want to count characters (rather than screen positions or bytes) I
> > must know how to define a character.
> 
> For what purpose do you want to count characters? Most of the time,
> people really are interested in counting either screen positions or
> bytes.

Man I knew you you were going to say that.

> If you count characters, then you have to decide whether you want
> to count only graphical base characters or all graphical characters or
> all characters.

This is basically what I'm asking about but I want generic multi-byte
string functions. In my experience modeling concepts is more successfull
than modeling the physical world so I am leaning to not just counting
screen positions or bytes because those are tied to display on which the
string is presented or the underlying serialized representation which
may be different depending on the encoding. So I think there must be a
separate set of functions to determine the actual "width" of a string
wrt a particular kind of display (i.e. a terminal). One might have three
sets of functions actually;

  mbssize  aka strlen determines size_t bytes
  mbswidth aka conv. to wchar_t and do wcswidth determines the
           number of screen positions
  mbslen   counts the number of characters where a "character" is
           something I still need to define.

The wcssize, wcswidth, and wcslen functions should produce equivalent
behavior.

> You also have to decide whether you want to turn the
> string first into a normalization form before you start counting. All
> these questions are difficult to answer without knowing why you want to
> have the count.

Well, I did't know about normalization but this is *why* I want to develop
these functions. Not many people understand this stuff at this level. We
(the common program writers) need these more generic functions. You are
suggesting each of us weave normalization and context sensitive code
throughout our applications?

> If you want to count an arbitrary subset of Unicode characters based on
> character range, character category, etc., then you can easily use the
> "uniset" software that I used to generate the combining characters table
> in wcwidth.c:
> 
>   http://www.cl.cam.ac.uk/~mgk25/download/uniset.tar.gz

Yes, I've seen this. I like the many interesting programs and code
snipplets you have in the ucs directory.

> You'll find the documentation of the Unicode character categories in
> 
>   http://www.unicode.org/Public/UNIDATA/UnicodeData.html#General%20Category
> 
> and in more detail in the Unicode 3.0 book.

This is exatly the kind of thing I'm interested in. I will look much
closer at this.

Thanks
Mike

-- 
May The Source be with you.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Counting "characters".

Reply via email to