Re: Additional multi-byte functions

Pablo Saratxaga Sun, 13 May 2001 14:41:11 -0700
Kaixo!

On Thu, May 10, 2001 at 06:09:12PM +0100, Markus Kuhn wrote:

>> Btw, I think that a number of CJK displaying problems I keep saying
>> are linked to the fact that strlen() doesn't work for non 8bit encodings.
>>
>> There is wcslen() of course; but often the strings are not in wc, but in mb.
>> Is there an mbslen()? (that is, a function like wcslen, but applied to
> 
> Sort of:
> 
>   #define mbslen(s, ps) mbsrtowcs(NULL, &s, SIZE_MAX, ps)
>   #define mbslen(s)     mbsrtowcs(NULL, &s, SIZE_MAX, NULL)
 
> Note that these functions return (size_t)(-1) if they run into a
> malformed sequence, which I think is a big hassle in practice in
> languages without exception handling.

Mmh, do you think replacing strlen(a) with msbslen(a) would be ok;
or it would be better to create a small function to do it properly?
(the cases I think of shouldn't involve malformed sequences; if there
is a malformed sequence, the output will be broken anyway)

> The length of a string matters for two applications:
> 
>   a) Find out how much memory to allocate. This requires a byte count,
>      and strlen does exactly what you want, even for multi-byte encodings.
> 
>   b) Find out, how many columns the cursor will advance if a string is sent
>      to a terminal. For wide strings, we have here wcswidth, but for
>      multibyte strings, there is no standardized convenient alternative.

In fact I'm thinking about caculation of the width used in GUI labels;
what makes me think it may be a problem of strlen() being used is that
for CJK languages the size of the window is twice the size of the text, with
the right half being blank. CJK text being 2bytes per char, it is a curious
coincidence (I'm not sure it is the reason; I just thought about that
possibility today and didn'ty checked on the sources yet).
 
> I don't think, you want to replace strlen with mbslen very frequently!

In fact all the str* family shouldn't been named like that, but rather
b* (as in byte); they don't deal with strings of text (that is, strings of
chars) but of strings of opaque and meaningless bytes. They are useful to
know the size of a string in memory, but not its size on display; it is too
bad that so many books have that ascii assumptions that text is a string
of chars with char=byte.

> The thing that I *REALLY* miss is the multi-byte version of wcwidth and
> wcswidth:
> 
>   mbwidth       column width of one multi-byte character
>   mbswidth      column width of a multi-byte string

yes.
In fact, I have yet to see a single user interactable text in wc rather
than in mb. Be it iso-8859-1, koi8-r, euc-jp, utf-8,... all widely used
encodings are mb; it is strange that those functions are not provided.

> Can mbwidth/mbswidth still be squeezed into the currently being
> finalized POSIX/SUS merger specification?

it would be nice if all str* ones could have a mb* equivalent (I see mb as
the logical extension to ascii)

> 
> Markus
> 
> -- 
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>
> 
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/lists/

-- 
Ki �a vos v�ye b�n,
Pablo Saratxaga

http://www.srtxg.easynet.be/            PGP Key available, key ID: 0x8F0E4975
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Additional multi-byte functions

Reply via email to