On Fri, Apr 15, 2005 at 11:44:03PM +0200, Juerd wrote:
: Is there a <?ws>-like thingy that is always \s+?

Not currently, since \s+ is there.  <?ws> used to be that, but
currently is defined as the magical whitespace matcher used by :words.

: Do \s and <?ws> match non-breaking whitespace, U+00A0?

Yes.

: How about:
: 
:     U+0008  backspace
:     U+00A0  no break space (Repeated for overview)
:     U+1361  ethiopic wordspace
:     U+2000  en quad
:     U+2001  em quad
:     U+2002  en space
:     U+2003  em space
:     U+2004  three per em space
:     U+2005  four per em space
:     U+2006  six per em space
:     U+2007  figure space
:     U+2008  punctuation space
:     U+2009  thin space 
:     U+200A  hair space
:     U+200B  zero width space
:     U+202F  narrow no break space
:     U+205F  medium mathematic space
:     U+2060  word joiner (What is that, anyway?)
:     U+3000  ideographic space
:     U+FEFF  zero width non-breaking space

Yes, any Unicode whitespace, but you seem to have a different list than
I do.  Outside of the standard ASCIIish control-character whitespace,
I count only the \pZ characters, not the \pC characters, so I don't have
to tell you what a word-joiner is, since it's a \p[Cf] character.  :-)

I will also gleefully ignore the existence of BOMs.

So I make it:

    0020;SPACE;Zs;0;WS;;;;;N;;;;;
    00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
    1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
    180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
    2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
    2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
    2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
    2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;;
    2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
    2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;
    202F;NARROW NO-BREAK SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
    205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
    3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

: \s is said (in S05) to match any unicode whitespace, but letting it
: match NBSP and then using \s for splitting things is wrong, I think.

Perhaps the default word split should not be based on \s then.
It's just one more difference, in addition to trimming leading and
trailing whitespace like awk.

: Are the contents of <> split using <?ws>? (Is <<$foo>>, where $foo is
: "foo\xA0bar", one or two elements?)

That is using the default word splitter (or it *is* the default word
splitter), so if the default word split is based on <+[\s]-[\xA0]>
it would be one element.

Of course, the ZERO WIDTH SPACE is a nasty critter for anyone using
whitespace to separate tokens.  That and maybe thin spaces probably
merit warnings in Perl code where they might cause visual ambiguity.

Larry

Reply via email to