On 19/10/2025 22:17, Collin Funk wrote:
Pádraig Brady <[email protected]> writes:
On 18/10/2025 22:05, Collin Funk wrote:
Pádraig Brady <[email protected]> writes:
There were various other multi-byte blanks issues,
and multi-byte issues in general when I looked further.
The attached 3 further patches should make numfmt fully support multi-byte.
numfmt is a nice case where we don't need to optimize MB_CUR_MAX ==
1,
thanks.
Right. But that got me thinking that we could optimize
in various cases, rather than resorting to mbsstr().
The attached implements mbsmbchr(mbs, mbc) to more efficiently
search for a multi-byte char in a multi-byte string,
especially with the usual UTF-8 charset
(which is determined with a single call to mbrtoc32() call per process).
I wonder if that function is worth putting in gl/ under LGPL in case we
want to use it in other programs and/or move it to Gnulib. It seems
useful to me.
Yes probably.
I was going to look at maybe using it in cut(1) too,
in which case it would definitely be appropriate to move to gl/
+ mbstate_t mbstate = {0,};
The following is slightly more efficient:
mbstate_t mbstate; mbszero (&mbstate);
ack
+ is_utf8 = mbrtoc32 (&w, "\xe2\x9f\xb8", 3, &mbstate) == 3 && w == 0x27F8;
You might want to copy the test from lib/quotearg.c instead, for
consistency:
/* snipped text...
If the current encoding is consistent with UTF-8 for U+2018,
assume that the locale uses UTF-8. This is safe in practice,
and means we need not use a function like locale_charset that
has other dependencies. */
static char const quote[][4] = { "\xe2\x80\x98", "\xe2\x80\x99" };
char32_t w;
mbstate_t mbs; mbszero (&mbs);
if (mbrtoc32 (&w, quote[0], 3, &mbs) == 3 && w == 0x2018)
return quote[msgid[0] == '\''];
That's a bit more verbose, also 0x2018 looks less like UTF8 than 0x27F8 ;)
cheers,
Padraig