Markus Kuhn wrote:
>Pablo Saratxaga wrote on 2001-05-10 15:50 UTC:
>> Btw, I think that a number of CJK displaying problems I keep saying
>> are linked to the fact that strlen() doesn't work for non 8bit encodings.
>>
>> There is wcslen() of course; but often the strings are not in wc, but in mb.
>> converting mb<->wc adds complexity, and some programmers that don't worry
>> too much about i18n won't care about it.
>>
>> Is there an mbslen()? (that is, a function like wcslen, but applied to
>> a mb string; that does any necessary mb<->wc conversion internally).
>
>Sort of:
>
> #define mbslen(s, ps) mbsrtowcs(NULL, &s, SIZE_MAX, ps)
> #define mbslen(s) mbsrtowcs(NULL, &s, SIZE_MAX, NULL)
>
>might do the job. (You can't use mbstowcs here unfortunately, because
>ISO C 99 doesn't specify that it can be used with pwcs==NULL. :-(( )
>
>Note that these functions return (size_t)(-1) if they run into a
>malformed sequence, which I think is a big hassle in practice in
>languages without exception handling.
Yes... I agree... this is almost useless in the "real world". A useful
implmentation would need to have a way to specify the behavior with respect to
illegal bytes (e.g. treat them as a 1-byte character, treat them as a 0-byte
character, throw up hands and return -1 :-).
>The length of a string matters for two applications:
>
> a) Find out how much memory to allocate. This requires a byte count,
> and strlen does exactly what you want, even for multi-byte encodings.
>
> b) Find out, how many columns the cursor will advance if a string is sent
> to a terminal. For wide strings, we have here wcswidth, but for
> multibyte strings, there is no standardized convenient alternative.
>
>I don't think, you want to replace strlen with mbslen very frequently!
>
>The thing that I *REALLY* miss is the multi-byte version of wcwidth and
>wcswidth:
>
> mbwidth column width of one multi-byte character
> mbswidth column width of a multi-byte string
Yes... I get your point, but in fact, if you have the bmsnth() call that I
mentioned in my earlier reply, you can actually use these mb variants to write
code that "looks" very much like traditional ascii string handling, which will
work for ascii, UTF-8, or any multi-byte encoding, and doesn't require the
programmer to be aware of the i18n details. This is an attractive (if in fact
difficult or impossible to truely achieve) idea.
-steve
Steve Swales
Sun Microsystems, Inc.
901 San Antonio Road, MS MPK16-201
Palo Alto, CA 94303-4900
650 786-0612 Direct
650 786-9553 Fax
[EMAIL PROTECTED]
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/