is_utf8_charset and multi-byte cut

Collin Funk Sun, 26 Oct 2025 22:04:40 -0700

Hi Pádraig,

Pádraig Brady <[email protected]> writes:


>>> Right. But that got me thinking that we could optimize
>>> in various cases, rather than resorting to mbsstr().
>>> The attached implements mbsmbchr(mbs, mbc) to more efficiently
>>> search for a multi-byte char in a multi-byte string,
>>> especially with the usual UTF-8 charset
>>> (which is determined with a single call to mbrtoc32() call per process).
>> I wonder if that function is worth putting in gl/ under LGPL in case
>> we
>> want to use it in other programs and/or move it to Gnulib. It seems
>> useful to me.
>
> Yes probably.
> I was going to look at maybe using it in cut(1) too,
> in which case it would definitely be appropriate to move to gl/

I was thinking about some i18n stuff today. A prerequisite to cut(1) is
getndelim2, which is probably the part that requires the most work.

For cut_fields, which uses getndelim2, I am thinking of a new function.
Something like this, which uses is_utf_charset() from numfmt:

    ssize_t
    mb_getndelim2 (char **lineptr, size_t *linesize, size_t offset,
                   size_t nmax, mcel_t delim1, mcel_t delim2,
                   FILE *stream)
    {
      if ((MB_CUR_LEN == 1 || delim1.ch < 0x30 && delim2.ch < 0x30)
          || (is_utf8_charset ()
              && delim1.ch < 0x80 && delim2.ch < 0x80)))
        return getndelim2 (lineptr, linesize, offset, nmax, delim1.ch,
                           delim2.ch, stream);
      mbbuf_t mbbuf;
      char buffer[BUFSIZ];
      mbbuf_init (&mbbuf, buffer, sizeof buffer, stream);
      /* Read from the file using mbbuf_get_char until reaching a
         demiliter, allocating and copying into LINEPTR as needed.  */
      return bytes_read;
    }

That would allow us to avoid many mbrtowc calls when using LC_ALL=C (*)
or when using a UTF-8 locale with ASCII delimiters.

WDYT?

Collin

(*) Ignoring systems where the C locale is UTF-8.

is_utf8_charset and multi-byte cut

Reply via email to