Hi Pádraig,

Pádraig Brady <[email protected]> writes:

> This patch set updates cut(1) to be multi-byte aware.
> It is also an attempt to reduce interface divergence across implementations.
>
> I've put the 60 patches here due to the quantity:
> https://github.com/pixelb/coreutils/compare/cut-mb

Thanks for working on this!

> # Interface / New functionality
>
>     macOS,  i18n, uutils, Toybox, Busybox, GNU
> -c    x      x       x      x        x      x
> -n    x      x                              x
> -w    x              x                      x
> -F                          x        x      x
> -O                          x        x      x
>
> -c is needed anyway as specified by all, including POSIX.
> -n is needed also as specified by i18n/macOS/POSIX
> -w is somewhat less important, but seeing as it's
> on two other common platforms (and its functionality is
> provided on two more), providing it is worthwhile for compat.
> -F and -O are really just aliases to other options
> so trivial to add, and probably worthwhile for compatibility.

I guess people like -w since it has been requested at least a few times,
IIRC. I never really cared for it since 'awk' is easy enough to use to
split at multiple blanks.

I don't think -F and -O are that useful, but there is only so much 'cut'
can do. I don't think someone will come up with divergent behavior for
them. So I guess it is okay.

> Interface / functionality notes:
>
> There is a slight divergence between -n implementations.
> There was already a difference between FreeBSD and i18n, and
> we've aligned with the more sensible FreeBSD implementation.
> Note the i18n -n implementation is otherwise buggy in any case,
> so I doubt this will be a practical compatibility concern.
> Actually -n is specified by POSIX, and it matches FreeBSD.
> Specifically our -n will not output a character unless the
> byte range encompasses _the end_ of the multi-byte character.
> I.e. the -b is a limit that is not passed, and thus ensures
> we don't output overlapping characters for separate cut
> invocations that do not have overlapping byte ranges.

Agreed, I think it is better to behave as POSIX requires even if it
diverges from the i18n patch.

> -d <regex> from toybox is not implemented.
> That's edge case functionality IMHO and not well suited to cut(1).
> This functionality is supported by awk, and regex functionality
> is best restricted to awk I think.

Agreed.

> cut is a significant part of the i18n patch, so it will be good
> to avoid that downstream divergence.  Unfortunately there were
> no tests with the cut i18n implementation.

Fun. :)

> # Performance
>
> General performance notes:
>
> We prefer byte searching (with -d) as that can be much faster
> than character by character processing, and it's supported
> on single byte and UTF-8 charsets.
> We also use byte searching with -w on uni-byte locales.
> This was seen to give up to 100x performance increase over the i18n patch.
>
> Where we do use per character processing, we avoid conversion
> to wide char when processing ASCII data (mcel provides this optimization).
> This was seen to give a 14x performance increase over the i18n patch.
>
> We prefer memchr() and strstr() as these are tuned for specific platforms
> on glibc, even if memchr2() or memmem() are algorithmically better.

Makes sense, but I hope this can be removed in the future:

   #if ! __GLIBC__  /* Only S390 has optimized memmem on glibc-2.42  */
     return memmem (buf, len, delim_bytes, delim_length);
   #else

> We maintain the important memory behavior of only buffering when necessary.

Nice.

> In summary, compared to the i18n patch we're now as fast in all cases,
> and much faster in most cases.
>
> We can see the -f byte searching performing well,
> being 120x faster in the no matching delimiter case,
> to at least 3x faster in the matching delimiter case.
>
> When we resort to per character processing we also compare well,
> being 14x faster in the ASCII processing case
> (due to mcel short-circuiting the wide char conversion).
> Note the processing mb.in results above also show a 2x win
> in per character processing cases, but the i18n patch would
> have also picked that win up as it's achieved separately
> to this patch set:
> https://lists.gnu.org/r/coreutils/2026-03/msg00117.html

Nice work!

I only had a quick skim over the patch, but it generally looks good.

It reminded me of one thing though. You used a buffer like this:

   static char line_in[IO_BUFSIZE];

I think this was a mistake in my multibyte 'fold' implementation. I'm
not really sure why I chose to use IO_BUFSIZE. It is meant to minimize
system call overhead, but since we are using fread/fwrite, libc chooses
how much to read and write per system call. For example on glibc:

    $ strace -e trace='/read|write' ./src/cut -f 1 /dev/zero  2>&1 | head
    [...]
    read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) 
= 262144
    write(1, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) 
= 262144
    read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) 
= 262144
    write(1, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 
4096
    write(1, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 258048) 
= 258048
    read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) 
= 262144
    write(1, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 
4096

I think it probably makes sense to just use BUFSIZ there. Likewise for
'fold' and 'expand':

    $ sed -i 's/IO_BUFSIZE/BUFSIZ/g' src/cut.c && make
    $ timeout 10 taskset 1 ./src/cut-iobufsize -f 1 /dev/zero  | taskset 2 pv 
-r > /dev/null
    [1.80GiB/s]
    $ timeout 10 taskset 1 ./src/cut -f 1 /dev/zero  | taskset 2 pv -r > 
/dev/null
    [2.18GiB/s]

Regarding the variable names "line_in" and "line_out". Those don't
really make too much sense. They were accurate when I changed 'fold' to
use getline(). But I should have changed that after moving to mbbuf. I
didn't realize until after pushing that change and didn't bother fixing
it, but it annoyed me a little bit. :)

Collin

Reply via email to