Stephane Chazelas <stephane.chaze...@gmail.com> wrote:
 |2017-05-17 14:00:21 +0200, Steffen Nurpmeso:
 |[...]
 |>|BTW, U+00A0, should really not be a [:blank:] or [:space:].
 |>|That's the whole point of that "non-breaking space" character.
 |> 
 |> It is clearly defined as whitespace in Unicode and thus ISO, at
 |> least once i last worked with the Unicode tables.  Not that my MUA
 |> gets that right at the moment, but in theory it should.  (We yet
 |> use a homegrown byte table for that, which was designed to deal
 |> with email RFCs, but i hope i soon find some time for my ctext
 |> Unicode thing again, finally, and in a distant future that MUA
 |> will get this right, too.)
 |
 |It's whitespace but it's non-breaking as in it should *not* be
 |used for delimiters. So either blank/space should not have
 |U+00A0 or the POSIX spec should be updated to *not* refer to
 |"blank" when it specifies delimiting behaviour IMO.

Well, i disagree a little bit here.  I would think of this
character as an indicator that breaking should not occur, rather
than anything else.  As in, if you find any other opportunity, do
not break here, to be used, for example, when writing a name, or
a title and a name, "Dr. Bob" for example, which looks really ugly
when splitted over newline.
What this has to do on a shell command line, i do not know, on the
other hand.  

 |Now, http://www.unicode.org/L2/L2003/03139-posix-classes.htm
 |recommends nbsp be included in the POSIX "blank"/"space"
 |classes, so I suppose there are quite a few people that don't
 |agree with me on that (note that I don't object of U+00A0 being
 |considered a blank/space but of it being considered a
 |delimiter).

It includes it in graphical not space.  It is from 2003.  TR29
puts it into OLetter, but also allows it as whitespace in number
formatting, together with SP and more.
I do not..  Well, there is "pattern whitespace", which perl(1)
also uses for some purposes, it is, i copy this from perl, since
my own thing is lingering unfinished, sigh,

  U+0009 CHARACTER TABULATION
  U+000A LINE FEED
  U+000B LINE TABULATION
  U+000C FORM FEED
  U+000D CARRIAGE RETURN
  U+0020 SPACE
  U+0085 NEXT LINE
  U+200E LEFT-TO-RIGHT MARK
  U+200F RIGHT-TO-LEFT MARK
  U+2028 LINE SEPARATOR
  U+2029 PARAGRAPH SEPARATOR

On the other hand, in an Unicode environment the \s regular
expression matches U+0080 too.  And to me U+200{E,F} are control
characters, hmm.  My CText thing that is a bit, well, unloved,
comes up with these characters for <blank>, mapped from space
separators:

      0x002000u, 0x00200Au, /*   -   */
EN QUAD, EM QUAD, EN SPACE, EM SPACE, THREE-PER-EM SPACE,
FOUR-PER-EM SPACE, SIX-PER-EM SPACE, FIGURE SPACE, PUNCTUATION
SPACE, THIN SPACE, HAIR SPACE.
      0x000009u, /*      */
      0x000020u, /*   */
      0x0000A0u, /*   */
      0x001680u, /*   */
OGHAM SPACE MARK
      0x00202Fu, /*   */
NARROW NO-BREAK SPACE
      0x00205Fu, /*   */
MEDIUM MATHEMATICAL SPACE
      0x003000u, /*   */
IDEOGRAPHIC SPACE

This is as of Unicode 6.3.
I mean, somehow the data must be derivable, and if ISO 10646
a.k.a. Unicode choose and use this semantics, then i would
(rather) go with them, they actually run on anything in the world.
You can always say LC_ALL=C or something, to get plain ASCII.
In my opinion it is up to a higher level parser to add decisions
based upon such code-points (following, e.g., Unicode
text-segmentation rules), the shell has its own quoting
mechanisms.

 |See also
 |http://www.unicode.org/L2/L2003/03139-posix-classes.htm#TR_14652
 |about ISO/IEC TR 14652:2004 including BS (backspace) and not
 |nbsp.
 |
 |What's the opengroup position?
 |
 |>|That's a known oddity of Solaris. (that makes it the only
 |>|single-byte blank I'm aware of, though of course one may always
 |>|construct a rogue locale that has more).
 |> 
 |> The only one besides U+0020 SP and U+0009 HT.
 |[...]
 |
 |Yes, of course, that's what I meant. The only non-ASCII single
 |byte character (0xa0 in many charsets, 0x9a in KOI8-U) that can
 |cause problem in practice with bash (or ksh88 it appears).

I have forgotten U+0085, sorry.

--steffen
|
|Ralph says i must not use signatures which spread the light!

Reply via email to