Hi!

On Fri, Jun 10, 2022 at 05:52:39PM +0200, Geoff Clare wrote:
> наб wrote, on 09 Jun 2022:
> > The question therefore becomes: which interpretation is right?
> > Given printf '\uFEFF \ta\n \ta\n\t         a' | unexpand,
> 
> I think this can be split into two separate issues:
> 
> 1. How the text applies to blank characters other than <space> and <tab>
> 
> 2. How to interpret the text when the only blank characters are
> <space> and <tab>.
> 
> I think the key to issue 2 is that when converting a sequence of <space>
> and <tab> characters into the maximum number of <tab>s followed by the
> minimum number of <space>s, each <tab> in the input can be considered to
> be unchanged in the output. It is only the <space>s that actually need
> to undergo a conversion (into zero or more <tab>s).
This runs counter to my intuitive reading, esp. for -a
(convert 2-or-more blanks & convert only spaces =>
 may not convert space+tab),
but does make sense when put that way, yeah.

> For issue 1, it seems to me that parts of the text were written with the
> assumption that <space> and <tab> are the only blank characters, and it
> is unclear whether other blank characters are supposed to be converted
> into (zero or more) <tab>s.
> 
> Your question at the end seems to assume that \uFEFF (ZERO WIDTH NO-BREAK
> SPACE) is a blank character in the current locale. Do you know of a
> locale where it is?  It isn't in en_GB.utf8 on Linux or en_GB.UTF-8 on
> Solaris. So with those locales the answer is that the first line of input
> is unchanged in the output.
Mistake on my part, copied the wrong run; I meant \u2009 (THIN SPACE),
which in this case is equivalent to your \u2002 example.

> The answer would be different ("the standard
> is unclear") for another character such as \u2002 (EN SPACE) which is
> blank in en_GB.utf8 on Linux (but not in en_GB.UTF-8 on Solaris).

Hm, as far as precedent goes, of the two extant implementations that
actually use characters instead of bytes (FreeBSD, Illumos),
neither appear to check iswblank().

> Your example is also missing a trailing newline, which means the input is
> not a text file and the behaviour is unspecified.  If you add the missing
> newline, the 2nd and 3rd lines in your example should be converted to
> "\ta\n\t\t a\n", and all the implementations I tried did that.

Thanks for your help,
наб

Attachment: signature.asc
Description: PGP signature

Reply via email to