Re: nbsp in \s, and <>

Larry Wall Fri, 15 Apr 2005 15:39:07 -0700

On Fri, Apr 15, 2005 at 11:44:03PM +0200, Juerd wrote:
: Is there a <?ws>-like thingy that is always \s+?


Not currently, since \s+ is there.  <?ws> used to be that, but
currently is defined as the magical whitespace matcher used by :words.

: Do \s and <?ws> match non-breaking whitespace, U+00A0?

Yes.

: How about:
: 
:     U+0008  backspace
:     U+00A0  no break space (Repeated for overview)
:     U+1361  ethiopic wordspace
:     U+2000  en quad
:     U+2001  em quad
:     U+2002  en space
:     U+2003  em space
:     U+2004  three per em space
:     U+2005  four per em space
:     U+2006  six per em space
:     U+2007  figure space
:     U+2008  punctuation space
:     U+2009  thin space 
:     U+200A  hair space
:     U+200B  zero width space
:     U+202F  narrow no break space
:     U+205F  medium mathematic space
:     U+2060  word joiner (What is that, anyway?)
:     U+3000  ideographic space
:     U+FEFF  zero width non-breaking space

Yes, any Unicode whitespace, but you seem to have a different list than
I do.  Outside of the standard ASCIIish control-character whitespace,
I count only the \pZ characters, not the \pC characters, so I don't have
to tell you what a word-joiner is, since it's a \p[Cf] character.  :-)

I will also gleefully ignore the existence of BOMs.

So I make it:

    0020;SPACE;Zs;0;WS;;;;;N;;;;;
    00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
    1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
    180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
    2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
    2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
    2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
    2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;;
    2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
    2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;
    202F;NARROW NO-BREAK SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
    205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

: \s is said (in S05) to match any unicode whitespace, but letting it
: match NBSP and then using \s for splitting things is wrong, I think.

Perhaps the default word split should not be based on \s then.
It's just one more difference, in addition to trimming leading and
trailing whitespace like awk.

: Are the contents of <> split using <?ws>? (Is <<$foo>>, where $foo is
: "foo\xA0bar", one or two elements?)

That is using the default word splitter (or it *is* the default word
splitter), so if the default word split is based on <+[\s]-[\xA0]>
it would be one element.

Of course, the ZERO WIDTH SPACE is a nasty critter for anyone using
whitespace to separate tokens.  That and maybe thin spaces probably
merit warnings in Perl code where they might cause visual ambiguity.

Larry

Re: nbsp in \s, and <>

Reply via email to