On 19/10/2025 22:17, Collin Funk wrote:
Pádraig Brady <[email protected]> writes:

On 18/10/2025 22:05, Collin Funk wrote:
Pádraig Brady <[email protected]> writes:

There were various other multi-byte blanks issues,
and multi-byte issues in general when I looked further.

The attached 3 further patches should make numfmt fully support multi-byte.
numfmt is a nice case where we don't need to optimize MB_CUR_MAX ==
1,
thanks.

Right. But that got me thinking that we could optimize
in various cases, rather than resorting to mbsstr().
The attached implements mbsmbchr(mbs, mbc) to more efficiently
search for a multi-byte char in a multi-byte string,
especially with the usual UTF-8 charset
(which is determined with a single call to mbrtoc32() call per process).

I wonder if that function is worth putting in gl/ under LGPL in case we
want to use it in other programs and/or move it to Gnulib. It seems
useful to me.

Yes probably.
I was going to look at maybe using it in cut(1) too,
in which case it would definitely be appropriate to move to gl/

+      mbstate_t mbstate = {0,};

The following is slightly more efficient:

     mbstate_t mbstate; mbszero (&mbstate);

ack

+      is_utf8 = mbrtoc32 (&w, "\xe2\x9f\xb8", 3, &mbstate) == 3 && w == 0x27F8;

You might want to copy the test from lib/quotearg.c instead, for
consistency:

    /* snipped text...
      If the current encoding is consistent with UTF-8 for U+2018,
      assume that the locale uses UTF-8.  This is safe in practice,
      and means we need not use a function like locale_charset that
      has other dependencies.  */
   static char const quote[][4] = { "\xe2\x80\x98", "\xe2\x80\x99" };
   char32_t w;
   mbstate_t mbs; mbszero (&mbs);
   if (mbrtoc32 (&w, quote[0], 3, &mbs) == 3 && w == 0x2018)
     return quote[msgid[0] == '\''];

That's a bit more verbose, also 0x2018 looks less like UTF8 than 0x27F8 ;)

cheers,
Padraig

Reply via email to