On Wed, Aug 13, 2014 at 7:06 PM, Harald Becker <[email protected]> wrote:
>> The world seems to be standardizing on utf-8.
>>
>> Thank God, supporting gazillion of encodings is no fun.
>
>
> You say this, but libbb/unicode.c contains a unicode_strlen calling this
> complex mb to wc conversion function to count the number of characters.
> Those multi byte functions tend to be highly complex and slow (don't know if
> they have gone better). For just UTF-8, things can be optimized.
bbox does have unicode-only implementation of mbstowc.
See unicode.c
> size_t utf8len( const char* s )
> {
> size_t n = 0;
> while (*s)
> if ((*s++ ^ 0x40) < 0xC0)
> n++;
> return n;
> }
>
> size_t mystrlen( const char* s )
> {
> return utf8_enabled ? utf8len(s) : strlen(s);
> }
>
> This looks more, but avoids inclusion of mb function. Most compiler shall
> produce fast code for utf8len.
There are situations where you need to do tons of unicode_strlen()
and you can tolerate getting wrong results on broken Unicode.
Then a function similar to yours can be very useful.
--
vda
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox