Re: is_utf8_charset and multi-byte cut

Pádraig Brady Mon, 27 Oct 2025 11:59:53 -0700

On 27/10/2025 05:03, Collin Funk wrote:

Hi Pádraig,


Pádraig Brady <[email protected]> writes:

Right. But that got me thinking that we could optimize
in various cases, rather than resorting to mbsstr().
The attached implements mbsmbchr(mbs, mbc) to more efficiently
search for a multi-byte char in a multi-byte string,
especially with the usual UTF-8 charset
(which is determined with a single call to mbrtoc32() call per process).

I wonder if that function is worth putting in gl/ under LGPL in case
we
want to use it in other programs and/or move it to Gnulib. It seems
useful to me.


Yes probably.
I was going to look at maybe using it in cut(1) too,
in which case it would definitely be appropriate to move to gl/


I was thinking about some i18n stuff today. A prerequisite to cut(1) is
getndelim2, which is probably the part that requires the most work.

For cut_fields, which uses getndelim2, I am thinking of a new function.
Something like this, which uses is_utf_charset() from numfmt:

     ssize_t
     mb_getndelim2 (char **lineptr, size_t *linesize, size_t offset,
                    size_t nmax, mcel_t delim1, mcel_t delim2,
                    FILE *stream)
     {
       if ((MB_CUR_LEN == 1 || delim1.ch < 0x30 && delim2.ch < 0x30)
           || (is_utf8_charset ()
               && delim1.ch < 0x80 && delim2.ch < 0x80)))
         return getndelim2 (lineptr, linesize, offset, nmax, delim1.ch,
                            delim2.ch, stream);
       mbbuf_t mbbuf;
       char buffer[BUFSIZ];
       mbbuf_init (&mbbuf, buffer, sizeof buffer, stream);
       /* Read from the file using mbbuf_get_char until reaching a
          demiliter, allocating and copying into LINEPTR as needed.  */
       return bytes_read;
     }

That would allow us to avoid many mbrtowc calls when using LC_ALL=C (*)
or when using a UTF-8 locale with ASCII delimiters.

WDYT?


Yes that optimization does look worthwhile.

Some general thoughts on handling delimiters...

getdelim() with '\n' or '\0' works with any multi-byte encoding (we support)
as those 2 chars don't overlap any multi-byte char.
For example numfmt uses this to get each line into a buffer,
with which it uses mbsmbchr() to efficiently find any delimiter.

I.e. for line oriented utils (like cut), perhaps it could be better to
base them around getdelim() to take advantage of that efficient/general
chunking of the input.  The disadvantage would be the extra buffering
requirements for (arbitrary long) lines, so maybe not appropriate for
all the cut_fields paths, but maybe you could use getdelim() as the fallback
in mb_getndelim2(), using mbsmbchr() to find the delimiter offset
that would be returned on the next call to mb_getndelim2().

As an aside, for non line oriented utils another optimization we might do
is to have a utf8buf akin to mbbuf but also taking advantage of the
self synchronizing nature of UTF8.
I.e. you can read large blocks into mem and just look at the last
few bytes to know when to set the end of the buffer aligned on a
character boundary, and keeping the remaining few bytes to be moved
to the start of the buffer before the next read.
Perhaps there is already such functionality in gnulib/libunistring
but I haven't looked yet.

BTW we did some tuning/benchmarking of cut a long time ago,
which would be worth repeating/comparing any new code with.
The commits were:

ef9db5735a401f60eb5b4a18a365bf1ece525053
791919f6d9a873ae7452a7e1d71e2fe9e0fd4104
465f9512b710ee2fe03c3caf65bfdccdce3544ae

thanks!
Padraig

Re: is_utf8_charset and multi-byte cut

Reply via email to