On 3/30/26 20:04, Collin Funk wrote:
Hi Pádraig,
Pádraig Brady <[email protected]> writes:
This patch set updates cut(1) to be multi-byte aware.
It is also an attempt to reduce interface divergence across implementations.
I've put the 60 patches here due to the quantity:
https://github.com/pixelb/coreutils/compare/cut-mb
Thanks for working on this!
# Interface / New functionality
macOS, i18n, uutils, Toybox, Busybox, GNU
-c x x x x x x
-n x x x
-w x x x
-F x x x
-O x x x
Yay compatibility! (The Android maintainer asked me to try to push it
for consistency between command implementations some years back...)
-c is needed anyway as specified by all, including POSIX.
-n is needed also as specified by i18n/macOS/POSIX
-w is somewhat less important, but seeing as it's
on two other common platforms (and its functionality is
provided on two more), providing it is worthwhile for compat.
"man cut" on debian 12 doesn't have -w and -n says "ignored"? Let's see...
https://man.freebsd.org/cgi/man.cgi?cut
Whitespace. So cut -F without specifying -d. Eh, easy enough to add...
-F and -O are really just aliases to other options
so trivial to add, and probably worthwhile for compatibility.
If I'd found other options that did this nine years ago, I wouldn't have
bothered...
I guess people like -w since it has been requested at least a few times,
IIRC. I never really cared for it since 'awk' is easy enough to use to
split at multiple blanks.
It pulls in a dependency on an entire programming language. It's not as
heavyweight as perl or python, but it's up there. ("The AWK Programming
Language" from 1988 is 228 pages: K&R C second edition is only 236.)
You get "cut" in coreutils as part of the standard set, but awk is its
own package with multiple _standalone_ implementations. Gnu has gawk,
debian's using mawk, android's using Brian Kernighan's one-true-awk from
1974 (still maintained apparently, although Kernighan seems to have
handed it off to Oz Yigit in 2023)...
I got an awk implementation contributed to toybox (which can't use
busybox's because licensing) which is twice the size of sed+tar+grep
_combined_ (or at least twice the line count).
I don't think -F and -O are that useful, but there is only so much 'cut'
can do. I don't think someone will come up with divergent behavior for
them. So I guess it is okay.
Interface / functionality notes:
There is a slight divergence between -n implementations.
There was already a difference between FreeBSD and i18n, and
we've aligned with the more sensible FreeBSD implementation.
Oh goddess, what did _they_ do about combining characters...
Note the i18n -n implementation is otherwise buggy in any case,
so I doubt this will be a practical compatibility concern.
Actually -n is specified by POSIX, and it matches FreeBSD.
Specifically our -n will not output a character unless the
byte range encompasses _the end_ of the multi-byte character.
I.e. the -b is a limit that is not passed, and thus ensures
we don't output overlapping characters for separate cut
invocations that do not have overlapping byte ranges.
Huh, I read the man page differently:
-n Do not split multi-byte characters. Characters will only be
output if at least one byte is selected, and, after a prefix of
zero or more unselected bytes, the rest of the bytes that form
the character are selected.
I thought "the rest of the bytes that form the character are selected"
meant the selection was expanded to include the end of a partially
selected character. (But that was a quick glance, not testing the
implementation. I need to set up ssh in my FreeBSD vm so I'm not
manually typing every test through the graphical window but can actually
script and paste stuff...)
What do they mean by "prefix" there, anyway? I thought combining
characters in unicode went _after_ the printable character (so you can
never be sure you're done until you overshoot or hit EOF, because
Microsoft was on the committee).
I hadn't directly opened the multibyte can of worms yet because "does
the range specify bytes or characters" and "does that mean visible
characters or combining characters" seemed like a design headache
requiring multiple new options I wasn't interested in unilaterally
declaring. That said, I'd vaguely assumed the regex engine could be
aware of that stuff and it handling iswspace() for me in the
"[[:space::]]" stuff was part of the appeal of doing it that way. The
old -f was bytes, regex could be unicode aware via libc, and it wasn't
MY immediate problem. :)
-d <regex> from toybox is not implemented.
>> That's edge case functionality IMHO and not well suited to cut(1).>>
This functionality is supported by awk, and regex functionality
is best restricted to awk I think.
Agreed.
Ok, I'll bite. What do you think -F does?
$ toybox --help cut | toybox cut -d $'\n' -f 15-18
-d Input delimiter (default is TAB for -f, run of whitespace for -F)
-D Don't sort/collate selections or match -fF lines without delimiter
-f Select fields (words) separated by single DELIM character
-F Select fields separated by DELIM regex
"cut -F" works like "cut -f" except it treats -d's argument as a regex
and changes its default value to "[[:space:]][[:space:]]*". (You have to
also specify -D to make "echo one two three | cut -D -d ' ' -f 2,2,1"
actually do what was asked of it, but that's a separate issue.)
Rob
P.S. is cut -d $'\n' actually documented in the man page?