Stephane Chazelas <stephane.chaze...@gmail.com> wrote:
 |2017-05-17 09:37:31 +0100, Geoff Clare:
 |[...]
 |> That would appear to be a bug in the standard, as it doesn't match
 |> existing practice in any of the shells I tried (with a UTF-8 locale):
 |> 
 |> $ printf 'echo\u00a0foo\n' | grep '[[:blank:]]'
 |> echo foo
 |> $ printf 'echo\u00a0foo\n' | sh                
 |> sh: echo�:  not found
 |> $ printf 'echo\u00a0foo\n' | ksh
 |> ksh[1]: echo foo: not found [No such file or directory]
 |> $ printf 'echo\u00a0foo\n' | bash
 |> bash: line 1: echo foo: command not found
 |> $ printf 'echo\u00a0foo\n' | POSIXLY_CORRECT=1 bash
 |> bash: line 1: echo foo: command not found
 |> 
 |> (This was on Solaris 11: "sh" is ksh88 and "ksh" is ksh93.)
 |> 
 |> Judging by the ksh88 error message it looks like it treated the a0
 |> byte of the Unicode NO-BREAK SPACE as a delimiter, so it might use
 |> all <blank> characters as delimiters in a single-byte locale, but
 |> not doing it for multibyte characters means it doesn't behave as
 |> described in the standard.
 |[...]
 |
 |bash has a similar issue. It treats [:blank:] as delimiters
 |only in locales with single-byte charsets. See
 |https://lists.gnu.org/archive/html/bug-bash/2014-10/msg00098.html
 |and the whole discussion there.
 |
 |I wasn't aware that ksh88 also honoured the locales for the
 |shell syntax tokenisation.
 |
 |Note that the issue is not only about the shell language. See
 |also awk (syntax, field parsing, string numerification...), bc,
 |xargs, m4...
 |
 |BTW, U+00A0, should really not be a [:blank:] or [:space:].
 |That's the whole point of that "non-breaking space" character.

It is clearly defined as whitespace in Unicode and thus ISO, at
least once i last worked with the Unicode tables.  Not that my MUA
gets that right at the moment, but in theory it should.  (We yet
use a homegrown byte table for that, which was designed to deal
with email RFCs, but i hope i soon find some time for my ctext
Unicode thing again, finally, and in a distant future that MUA
will get this right, too.)

 |That's a known oddity of Solaris. (that makes it the only
 |single-byte blank I'm aware of, though of course one may always
 |construct a rogue locale that has more).

The only one besides U+0020 SP and U+0009 HT.

--steffen
|
|Ralph says i must not use signatures which spread the light!

Reply via email to