On 31/03/2026 02:04, Collin Funk wrote:
Hi Pádraig,

Pádraig Brady <[email protected]> writes:

This patch set updates cut(1) to be multi-byte aware.
It is also an attempt to reduce interface divergence across implementations.

I've put the 60 patches here due to the quantity:
https://github.com/pixelb/coreutils/compare/cut-mb

Thanks for working on this!

# Interface / New functionality

     macOS,  i18n, uutils, Toybox, Busybox, GNU
-c    x      x       x      x        x      x
-n    x      x                              x
-w    x              x                      x
-F                          x        x      x
-O                          x        x      x

-c is needed anyway as specified by all, including POSIX.
-n is needed also as specified by i18n/macOS/POSIX
-w is somewhat less important, but seeing as it's
on two other common platforms (and its functionality is
provided on two more), providing it is worthwhile for compat.
-F and -O are really just aliases to other options
so trivial to add, and probably worthwhile for compatibility.

I guess people like -w since it has been requested at least a few times,
IIRC. I never really cared for it since 'awk' is easy enough to use to
split at multiple blanks.

Yes it was always one of those 50:50 ones,
but the balance has shifted given the existing implementations.

We prefer memchr() and strstr() as these are tuned for specific platforms
on glibc, even if memchr2() or memmem() are algorithmically better.

Makes sense, but I hope this can be removed in the future:

    #if ! __GLIBC__  /* Only S390 has optimized memmem on glibc-2.42  */
      return memmem (buf, len, delim_bytes, delim_length);
    #else

Yes, Chris Down mentioned he may look at implementing SIMD memmem().
Something to be revisited in future anyway.

I only had a quick skim over the patch, but it generally looks good.

It reminded me of one thing though. You used a buffer like this:

    static char line_in[IO_BUFSIZE];

I think this was a mistake in my multibyte 'fold' implementation. I'm
not really sure why I chose to use IO_BUFSIZE. It is meant to minimize
system call overhead, but since we are using fread/fwrite, libc chooses
how much to read and write per system call. For example on glibc:

     $ strace -e trace='/read|write' ./src/cut -f 1 /dev/zero  2>&1 | head
     [...]
     read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) 
= 262144
     write(1, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) 
= 262144
     read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) 
= 262144
     write(1, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 
4096
     write(1, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 258048) 
= 258048
     read(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) 
= 262144
     write(1, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 
4096

I think it probably makes sense to just use BUFSIZ there. Likewise for
'fold' and 'expand':

     $ sed -i 's/IO_BUFSIZE/BUFSIZ/g' src/cut.c && make
     $ timeout 10 taskset 1 ./src/cut-iobufsize -f 1 /dev/zero  | taskset 2 pv -r 
> /dev/null
     [1.80GiB/s]
     $ timeout 10 taskset 1 ./src/cut -f 1 /dev/zero  | taskset 2 pv -r > 
/dev/null
     [2.18GiB/s]


You're right that stdio picks the I/O sizes, but the fread/fwrite size
is a big hint to the I/O size to use and using IO_BUFSIZE is always a win for 
me.
We can also force the stdio I/O size with stdbuf (or setvbuf),
so testing various combinations here:

$ timeout 5 taskset 1 ./src/cut-bufsiz -f 1 /dev/zero | taskset 1 pv -r 
>/dev/null
[ 649MiB/s]
$ timeout 5 taskset 1 ./src/cut-bufsiz -f 1 /dev/zero | taskset 2 pv -r 
>/dev/null
[1.29GiB/s]
$ timeout 5 taskset 1 ./src/cut-iobufsize -f 1 /dev/zero | taskset 1 pv -r 
>/dev/null
[2.51GiB/s]
$ timeout 5 taskset 1 ./src/cut-iobufsize -f 1 /dev/zero | taskset 2 pv -r 
>/dev/null
[1.98GiB/s]
$ timeout 5 taskset 1 stdbuf -o 262144 ./src/cut-iobufsize -f 1 /dev/zero | 
taskset 1 pv -r >/dev/null
[2.38GiB/s]
$ timeout 5 taskset 1 stdbuf -o 262144 ./src/cut-iobufsize -f 1 /dev/zero | 
taskset 2 pv -r >/dev/null
[1.89GiB/s]

So in summary the current IO_BUFSIZE performs best for me.
I was considering setting setvbuf to IO_BUFSIZE,
but there seems to be no need given the above results.

I also did a full perf run with all previously shown option combinations,
and all performed a bit better with the larger I/O size.

Regarding the variable names "line_in" and "line_out". Those don't
really make too much sense. They were accurate when I changed 'fold' to
use getline(). But I should have changed that after moving to mbbuf. I
didn't realize until after pushing that change and didn't bother fixing
it, but it annoyed me a little bit. :)
Good point. Renamed to bytes_in.

cheers,
Padraig

Reply via email to