Stephane Chazelas <stephane.chaze...@gmail.com> wrote: |2017-05-17 09:37:31 +0100, Geoff Clare: |[...] |> That would appear to be a bug in the standard, as it doesn't match |> existing practice in any of the shells I tried (with a UTF-8 locale): |> |> $ printf 'echo\u00a0foo\n' | grep '[[:blank:]]' |> echo foo |> $ printf 'echo\u00a0foo\n' | sh |> sh: echo�: not found |> $ printf 'echo\u00a0foo\n' | ksh |> ksh[1]: echo foo: not found [No such file or directory] |> $ printf 'echo\u00a0foo\n' | bash |> bash: line 1: echo foo: command not found |> $ printf 'echo\u00a0foo\n' | POSIXLY_CORRECT=1 bash |> bash: line 1: echo foo: command not found |> |> (This was on Solaris 11: "sh" is ksh88 and "ksh" is ksh93.) |> |> Judging by the ksh88 error message it looks like it treated the a0 |> byte of the Unicode NO-BREAK SPACE as a delimiter, so it might use |> all <blank> characters as delimiters in a single-byte locale, but |> not doing it for multibyte characters means it doesn't behave as |> described in the standard. |[...] | |bash has a similar issue. It treats [:blank:] as delimiters |only in locales with single-byte charsets. See |https://lists.gnu.org/archive/html/bug-bash/2014-10/msg00098.html |and the whole discussion there. | |I wasn't aware that ksh88 also honoured the locales for the |shell syntax tokenisation. | |Note that the issue is not only about the shell language. See |also awk (syntax, field parsing, string numerification...), bc, |xargs, m4... | |BTW, U+00A0, should really not be a [:blank:] or [:space:]. |That's the whole point of that "non-breaking space" character.
It is clearly defined as whitespace in Unicode and thus ISO, at least once i last worked with the Unicode tables. Not that my MUA gets that right at the moment, but in theory it should. (We yet use a homegrown byte table for that, which was designed to deal with email RFCs, but i hope i soon find some time for my ctext Unicode thing again, finally, and in a distant future that MUA will get this right, too.) |That's a known oddity of Solaris. (that makes it the only |single-byte blank I'm aware of, though of course one may always |construct a rogue locale that has more). The only one besides U+0020 SP and U+0009 HT. --steffen | |Ralph says i must not use signatures which spread the light!