Hi! On Fri, Jun 10, 2022 at 05:52:39PM +0200, Geoff Clare wrote: > наб wrote, on 09 Jun 2022: > > The question therefore becomes: which interpretation is right? > > Given printf '\uFEFF \ta\n \ta\n\t a' | unexpand, > > I think this can be split into two separate issues: > > 1. How the text applies to blank characters other than <space> and <tab> > > 2. How to interpret the text when the only blank characters are > <space> and <tab>. > > I think the key to issue 2 is that when converting a sequence of <space> > and <tab> characters into the maximum number of <tab>s followed by the > minimum number of <space>s, each <tab> in the input can be considered to > be unchanged in the output. It is only the <space>s that actually need > to undergo a conversion (into zero or more <tab>s). This runs counter to my intuitive reading, esp. for -a (convert 2-or-more blanks & convert only spaces => may not convert space+tab), but does make sense when put that way, yeah.
> For issue 1, it seems to me that parts of the text were written with the
> assumption that <space> and <tab> are the only blank characters, and it
> is unclear whether other blank characters are supposed to be converted
> into (zero or more) <tab>s.
>
> Your question at the end seems to assume that \uFEFF (ZERO WIDTH NO-BREAK
> SPACE) is a blank character in the current locale. Do you know of a
> locale where it is? It isn't in en_GB.utf8 on Linux or en_GB.UTF-8 on
> Solaris. So with those locales the answer is that the first line of input
> is unchanged in the output.
Mistake on my part, copied the wrong run; I meant \u2009 (THIN SPACE),
which in this case is equivalent to your \u2002 example.
> The answer would be different ("the standard
> is unclear") for another character such as \u2002 (EN SPACE) which is
> blank in en_GB.utf8 on Linux (but not in en_GB.UTF-8 on Solaris).
Hm, as far as precedent goes, of the two extant implementations that
actually use characters instead of bytes (FreeBSD, Illumos),
neither appear to check iswblank().
> Your example is also missing a trailing newline, which means the input is
> not a text file and the behaviour is unspecified. If you add the missing
> newline, the 2nd and 3rd lines in your example should be converted to
> "\ta\n\t\t a\n", and all the implementations I tried did that.
Thanks for your help,
наб
signature.asc
Description: PGP signature
