On 07/04/2026 20:25, Rob Landley wrote:
On 3/30/26 20:04, Collin Funk wrote:
Hi Pádraig,

Pádraig Brady <[email protected]> writes:

This patch set updates cut(1) to be multi-byte aware.
It is also an attempt to reduce interface divergence across implementations.

I've put the 60 patches here due to the quantity:
https://github.com/pixelb/coreutils/compare/cut-mb

Thanks for working on this!

# Interface / New functionality

      macOS,  i18n, uutils, Toybox, Busybox, GNU
-c    x      x       x      x        x      x
-n    x      x                              x
-w    x              x                      x
-F                          x        x      x
-O                          x        x      x

Yay compatibility! (The Android maintainer asked me to try to push it
for consistency between command implementations some years back...)

-c is needed anyway as specified by all, including POSIX.
-n is needed also as specified by i18n/macOS/POSIX
-w is somewhat less important, but seeing as it's
on two other common platforms (and its functionality is
provided on two more), providing it is worthwhile for compat.

"man cut" on debian 12 doesn't have -w and -n says "ignored"? Let's see...

https://man.freebsd.org/cgi/man.cgi?cut

Whitespace. So cut -F without specifying -d. Eh, easy enough to add...


-F and -O are really just aliases to other options
so trivial to add, and probably worthwhile for compatibility.

If I'd found other options that did this nine years ago, I wouldn't have
bothered...

FWIW it seems FreeBSD added -2 in release 9.2, Sep 2013.

I guess people like -w since it has been requested at least a few times,
IIRC. I never really cared for it since 'awk' is easy enough to use to
split at multiple blanks.

It pulls in a dependency on an entire programming language. It's not as
heavyweight as perl or python, but it's up there. ("The AWK Programming
Language" from 1988 is 228 pages: K&R C second edition is only 236.)

You get "cut" in coreutils as part of the standard set, but awk is its
own package with multiple _standalone_ implementations. Gnu has gawk,
debian's using mawk, android's using Brian Kernighan's one-true-awk from
1974 (still maintained apparently, although Kernighan seems to have
handed it off to Oz Yigit in 2023)...

I got an awk implementation contributed to toybox (which can't use
busybox's because licensing) which is twice the size of sed+tar+grep
_combined_ (or at least twice the line count).

cut(1) tries to be an efficient streaming filter,
and regexes don't fit that mold really.
Given there are existing solutions for somewhat edge case functionality
it seems not appropriate to add IMHO.

I don't think -F and -O are that useful, but there is only so much 'cut'
can do. I don't think someone will come up with divergent behavior for
them. So I guess it is okay.

Interface / functionality notes:

There is a slight divergence between -n implementations.
There was already a difference between FreeBSD and i18n, and
we've aligned with the more sensible FreeBSD implementation.

Oh goddess, what did _they_ do about combining characters...

GNU treats combining characters as separate as per:
https://github.com/coreutils/coreutils/commit/fe0082333

P.S. is cut -d $'\n' actually documented in the man page?

It will soon be documented in the info manual
(which is linked from the man page):
https://github.com/coreutils/coreutils/commit/c3e819fad

cheers,
Padraig

Reply via email to