Re: Possible Unicode Problems in Busybox - Collect and Discussion

Denys Vlasenko Fri, 15 Aug 2014 04:01:33 -0700

On Wed, Aug 13, 2014 at 7:06 PM, Harald Becker <[email protected]> wrote:
>> The world seems to be standardizing on utf-8.
>>
>> Thank God, supporting gazillion of encodings is no fun.
>
>
> You say this, but libbb/unicode.c contains a unicode_strlen calling this
> complex mb to wc conversion function to count the number of characters.
> Those multi byte functions tend to be highly complex and slow (don't know if
> they have gone better). For just UTF-8, things can be optimized.


bbox does have unicode-only implementation of mbstowc.
See unicode.c

> size_t utf8len( const char* s )
> {
>   size_t n = 0;
>   while (*s)
>     if ((*s++ ^ 0x40) < 0xC0)
>       n++;
>   return n;
> }
>
> size_t mystrlen( const char* s )
> {
>   return utf8_enabled ? utf8len(s) : strlen(s);
> }
>
> This looks more, but avoids inclusion of mb function. Most compiler shall
> produce fast code for utf8len.

There are situations where you need to do tons of unicode_strlen()
and you can tolerate getting wrong results on broken Unicode.
Then a function similar to yours can be very useful.

-- 
vda
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

Reply via email to