On Friday, February 23rd, 2024 at 21:14, Mouse <mo...@rodents-montreal.org> wrote: > > unexpand "converts spaces to tabs". > > > This commands behavior is so simple (s/ /\t/g) that it can be > > knocked out in a couple hours, > > Well...sort of. unexpand without -a can be, sure. With -a, it's more > complicated, unless you are willing to assume things like "no multibyte > characters" or "all non-ASCII text is Shift-JIS". > > > Since the command only looks for 2 characters (' ' and '\t'), no UTF > > safety checking is required, > > Safety? If you want to support multibyte characters of any sort with > -a, you need to parse them enough to determine how many bytes make up > each character, because that affects how many spaces to eat to convert > to a tab. (Without -a, this is not an issue.) > > For example, if you get a line containing, in hex, > > d0 b0 d0 b0 d0 b0 20 20 20 20 20 20 20 20 40 > > then (assuming 8-character tabstops and -a in effect), then under > 8859-1 you have (to use Unicode names) LATIN CAPITAL LETTER ETH and > DEGREE SIGN, with the pair repeated three times, and you thus convert > the first two of the spaces to a tab, but under UTF-8 you have three > instances of CYRILLIC SMALL LETTER A and you thus convert the first > five of the spaces to a tab. (Handling tabs in the input makes it > even more complicated.)
>From the NetBSD Manpage you quote later: "If the -a option is given, then tabs are inserted whenever they would compress the resultant file by replacing two or more characters." Correct me if I'm wrong, But I don't see how utf8 has anything to do with this? it takes a string of spaces, Replaces it with length/tabwidth tabs, then length%tabwidth spaces, POSIX says this too: "translate all sequences of two or more <blank> characters immediately preceding a tab stop to the maximum number of <tab> characters followed by the minimum number of <space> characters needed to fill the same column positions originally filled by the translated <blank> characters." Sigh, skimming over lib/utf8.c, assuming utf8len() is like strlen() but for utf8, that might make things a bit easier? Was hoping to never have to touch utf8 while writing this. > > The GNU man page doesn't say if spaces are supposed to be processed > > beyond the beginning of lines. > > > The GNU man page is relevant to only the GNU version. It's not relevant to _any_ version because it does not document the behavior of any implementation, Not even it's own. It fails to document known user-end noticeable things such as the actual behavior of -a. Saying "convert all blanks, instead of just initial blanks" and NOTHING else for the behavior of -a is misleading. > I would not use > it as a reference for anything else, least of all what the command > should do in the abstract. (That said, I would have hoped they would > document their software more precisely, such as saying what happens to > non-initial whitespace in the absence of -a.) > > A non-GNU (NetBSD) manapge I have handy says > > -a By default, only leading blanks and tabs are reconverted to maximal > strings of tabs. If the -a option is given, then tabs are inserted > whenever they would compress the resultant file by replacing two or > more characters. > > which is, at least, clearer. (That version has nothing like GNU's > --first-only, or at least the manpage doesn't.) > > > [...], and the "--first-only" option serves the same purpose as grep > > -G (None at all, [...]) > > > Actually, it does; it can be specified to get the default behaviour > when the opposing option might have been specified already. For > example, if I have a wrapper script (let's call it "unex") > > #! /bin/sh > set $UNEX_OPTIONS "$@" > unexpand "$@" > > then I can run "unex --first-only" to get the default behaviour > regardless of whether -a is present in $UNEX_OPTIONS. - Oliver Webb <aquahobby...@proton.me> _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net