On Wed, Feb 04, 2026 at 04:47:03PM +0100, Vincent Lefevre wrote:
> In the upstream bug, Eric Blake said: "Several distros have add-on
> patches that add wide char support, but to date, no one has yet
> submitted a patch upstream that is both easy to maintain (doesn't
> needlessly duplicate big blocks of code over char vs. wchar_t) and
> which doesn't penalize speed on single-byte locales."
FTR, in voreutils cut (0BSD:
 <http://ro.ws.co.ls/cut.1>,
 <https://git.sr.ht/~nabijaczleweli/voreutils/tree/trunk/item/cmd/cut.cpp>),
this is implemented with the -d argument being a byte span ("field_sep"),
so delimiter search reduces to memmem()/memchr() ("l.find(*field_sep)"),
which means -d: -dя -d$'\377' -dupa are all equivalent;
this seemed like an obvious generalisation to me,
so cut(1), STANDARDS, just notes that
> Allowing -d longer than one character is an extension, compatible
> with the illumos gate ‒ some nonconformant implementations only allow
> a single byte (the GNU system) or only use the first byte of the
> delim (NetBSD, OpenBSD). Using NUL for an empty delim is likewise
> an extension, compatible with the illumos gate, the GNU system,
> NetBSD, and OpenBSD.

$ echo QWEaQWEabQWE | cut -d'ab' -f2
QWE
$ echo QWEaQWEabQWE | /bin/cut -d'ab' -f2
/bin/cut: the delimiter must be a single character
Try '/bin/cut --help' for more information.

I believe you get the same result as the first line on the illumos gate
(I tested this on tribblix, if memory serves).
Parsing the input as characters only happens in -nb and -c modes,
and only for mbrlen(), which is the minimum required.

So duplication is not necessary. Of course, one can construe of an
encoding where you could encode я into bytes two different ways,
and you'd want cut -dя to match both. Whether that is real,
whether you consider that to be real, and whether that would be
a useful behaviour vs byte span matching will inform whether
that implementation model is viable for coreutils.

Best,

Attachment: signature.asc
Description: PGP signature

Reply via email to