Hi Pádraig, Pádraig Brady <[email protected]> writes:
> This patch set updates cut(1) to be multi-byte aware. > It is also an attempt to reduce interface divergence across implementations. > > I've put the 60 patches here due to the quantity: > https://github.com/pixelb/coreutils/compare/cut-mb Thanks for working on this! > # Interface / New functionality > > macOS, i18n, uutils, Toybox, Busybox, GNU > -c x x x x x x > -n x x x > -w x x x > -F x x x > -O x x x > > -c is needed anyway as specified by all, including POSIX. > -n is needed also as specified by i18n/macOS/POSIX > -w is somewhat less important, but seeing as it's > on two other common platforms (and its functionality is > provided on two more), providing it is worthwhile for compat. > -F and -O are really just aliases to other options > so trivial to add, and probably worthwhile for compatibility. I guess people like -w since it has been requested at least a few times, IIRC. I never really cared for it since 'awk' is easy enough to use to split at multiple blanks. I don't think -F and -O are that useful, but there is only so much 'cut' can do. I don't think someone will come up with divergent behavior for them. So I guess it is okay. > Interface / functionality notes: > > There is a slight divergence between -n implementations. > There was already a difference between FreeBSD and i18n, and > we've aligned with the more sensible FreeBSD implementation. > Note the i18n -n implementation is otherwise buggy in any case, > so I doubt this will be a practical compatibility concern. > Actually -n is specified by POSIX, and it matches FreeBSD. > Specifically our -n will not output a character unless the > byte range encompasses _the end_ of the multi-byte character. > I.e. the -b is a limit that is not passed, and thus ensures > we don't output overlapping characters for separate cut > invocations that do not have overlapping byte ranges. Agreed, I think it is better to behave as POSIX requires even if it diverges from the i18n patch. > -d <regex> from toybox is not implemented. > That's edge case functionality IMHO and not well suited to cut(1). > This functionality is supported by awk, and regex functionality > is best restricted to awk I think. Agreed. > cut is a significant part of the i18n patch, so it will be good > to avoid that downstream divergence. Unfortunately there were > no tests with the cut i18n implementation. Fun. :) > # Performance > > General performance notes: > > We prefer byte searching (with -d) as that can be much faster > than character by character processing, and it's supported > on single byte and UTF-8 charsets. > We also use byte searching with -w on uni-byte locales. > This was seen to give up to 100x performance increase over the i18n patch. > > Where we do use per character processing, we avoid conversion > to wide char when processing ASCII data (mcel provides this optimization). > This was seen to give a 14x performance increase over the i18n patch. > > We prefer memchr() and strstr() as these are tuned for specific platforms > on glibc, even if memchr2() or memmem() are algorithmically better. Makes sense, but I hope this can be removed in the future: #if ! __GLIBC__ /* Only S390 has optimized memmem on glibc-2.42 */ return memmem (buf, len, delim_bytes, delim_length); #else > We maintain the important memory behavior of only buffering when necessary. Nice. > In summary, compared to the i18n patch we're now as fast in all cases, > and much faster in most cases. > > We can see the -f byte searching performing well, > being 120x faster in the no matching delimiter case, > to at least 3x faster in the matching delimiter case. > > When we resort to per character processing we also compare well, > being 14x faster in the ASCII processing case > (due to mcel short-circuiting the wide char conversion). > Note the processing mb.in results above also show a 2x win > in per character processing cases, but the i18n patch would > have also picked that win up as it's achieved separately > to this patch set: > https://lists.gnu.org/r/coreutils/2026-03/msg00117.html Nice work! I only had a quick skim over the patch, but it generally looks good. It reminded me of one thing though. You used a buffer like this: static char line_in[IO_BUFSIZE]; I think this was a mistake in my multibyte 'fold' implementation. I'm not really sure why I chose to use IO_BUFSIZE. It is meant to minimize system call overhead, but since we are using fread/fwrite, libc chooses how much to read and write per system call. For example on glibc: $ strace -e trace='/read|write' ./src/cut -f 1 /dev/zero 2>&1 | head [...] read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 258048) = 258048 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 I think it probably makes sense to just use BUFSIZ there. Likewise for 'fold' and 'expand': $ sed -i 's/IO_BUFSIZE/BUFSIZ/g' src/cut.c && make $ timeout 10 taskset 1 ./src/cut-iobufsize -f 1 /dev/zero | taskset 2 pv -r > /dev/null [1.80GiB/s] $ timeout 10 taskset 1 ./src/cut -f 1 /dev/zero | taskset 2 pv -r > /dev/null [2.18GiB/s] Regarding the variable names "line_in" and "line_out". Those don't really make too much sense. They were accurate when I changed 'fold' to use getline(). But I should have changed that after moving to mbbuf. I didn't realize until after pushing that change and didn't bother fixing it, but it annoyed me a little bit. :) Collin
