Pádraig Brady <[email protected]> writes:
> On 18/10/2025 22:05, Collin Funk wrote:
>> Pádraig Brady <[email protected]> writes:
>>
>>> There were various other multi-byte blanks issues,
>>> and multi-byte issues in general when I looked further.
>>>
>>> The attached 3 further patches should make numfmt fully support multi-byte.
>> numfmt is a nice case where we don't need to optimize MB_CUR_MAX ==
>> 1,
>> thanks.
>
> Right. But that got me thinking that we could optimize
> in various cases, rather than resorting to mbsstr().
> The attached implements mbsmbchr(mbs, mbc) to more efficiently
> search for a multi-byte char in a multi-byte string,
> especially with the usual UTF-8 charset
> (which is determined with a single call to mbrtoc32() call per process).
I wonder if that function is worth putting in gl/ under LGPL in case we
want to use it in other programs and/or move it to Gnulib. It seems
useful to me.
> + mbstate_t mbstate = {0,};
The following is slightly more efficient:
mbstate_t mbstate; mbszero (&mbstate);
> + is_utf8 = mbrtoc32 (&w, "\xe2\x9f\xb8", 3, &mbstate) == 3 && w ==
> 0x27F8;
You might want to copy the test from lib/quotearg.c instead, for
consistency:
/* snipped text...
If the current encoding is consistent with UTF-8 for U+2018,
assume that the locale uses UTF-8. This is safe in practice,
and means we need not use a function like locale_charset that
has other dependencies. */
static char const quote[][4] = { "\xe2\x80\x98", "\xe2\x80\x99" };
char32_t w;
mbstate_t mbs; mbszero (&mbs);
if (mbrtoc32 (&w, quote[0], 3, &mbs) == 3 && w == 0x2018)
return quote[msgid[0] == '\''];
Collin