Markus Kuhn wrote:
>Pablo Saratxaga wrote on 2001-05-10 15:50 UTC:
>> Btw, I think that a number of CJK displaying problems I keep saying
>> are linked to the fact that strlen() doesn't work for non 8bit encodings.
>>
>> There is wcslen() of course; but often the strings are not in wc, but in mb.
>> converting mb<->wc adds complexity, and some programmers that don't worry
>> too much about i18n won't care about it.
>>
>> Is there an mbslen()? (that is, a function like wcslen, but applied to
>> a mb string; that does any necessary mb<->wc conversion internally).
>
>Sort of:
>
>  #define mbslen(s, ps) mbsrtowcs(NULL, &s, SIZE_MAX, ps)
>  #define mbslen(s)     mbsrtowcs(NULL, &s, SIZE_MAX, NULL)
>
>might do the job. (You can't use mbstowcs here unfortunately, because
>ISO C 99 doesn't specify that it can be used with pwcs==NULL. :-(( )
>
>Note that these functions return (size_t)(-1) if they run into a
>malformed sequence, which I think is a big hassle in practice in
>languages without exception handling.

Yes... I agree... this is almost useless in the "real world".  A useful
implmentation would need to have a way to specify the behavior with respect to
illegal bytes (e.g. treat them as a 1-byte character, treat them as a 0-byte 
character, throw up hands and return -1 :-).

>The length of a string matters for two applications:
>
>  a) Find out how much memory to allocate. This requires a byte count,
>     and strlen does exactly what you want, even for multi-byte encodings.
>
>  b) Find out, how many columns the cursor will advance if a string is sent
>     to a terminal. For wide strings, we have here wcswidth, but for
>     multibyte strings, there is no standardized convenient alternative.
>
>I don't think, you want to replace strlen with mbslen very frequently!
>
>The thing that I *REALLY* miss is the multi-byte version of wcwidth and
>wcswidth:
>
>  mbwidth       column width of one multi-byte character
>  mbswidth      column width of a multi-byte string

Yes... I get your point, but in fact, if you have the bmsnth() call that I
mentioned in my earlier reply, you can actually use these mb variants to write 
code that "looks" very much like traditional ascii string handling, which will
work for ascii, UTF-8, or any multi-byte encoding, and doesn't require the 
programmer to be aware of the i18n details.  This is an attractive (if in fact 
difficult or impossible to truely achieve) idea.

-steve

Steve Swales
Sun Microsystems, Inc.
901 San Antonio Road, MS MPK16-201
Palo Alto, CA 94303-4900
650 786-0612 Direct
650 786-9553 Fax
[EMAIL PROTECTED]

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to