On Fri, Apr 15, 2005 at 11:44:03PM +0200, Juerd wrote: : Is there a <?ws>-like thingy that is always \s+?
Not currently, since \s+ is there. <?ws> used to be that, but currently is defined as the magical whitespace matcher used by :words. : Do \s and <?ws> match non-breaking whitespace, U+00A0? Yes. : How about: : : U+0008 backspace : U+00A0 no break space (Repeated for overview) : U+1361 ethiopic wordspace : U+2000 en quad : U+2001 em quad : U+2002 en space : U+2003 em space : U+2004 three per em space : U+2005 four per em space : U+2006 six per em space : U+2007 figure space : U+2008 punctuation space : U+2009 thin space : U+200A hair space : U+200B zero width space : U+202F narrow no break space : U+205F medium mathematic space : U+2060 word joiner (What is that, anyway?) : U+3000 ideographic space : U+FEFF zero width non-breaking space Yes, any Unicode whitespace, but you seem to have a different list than I do. Outside of the standard ASCIIish control-character whitespace, I count only the \pZ characters, not the \pC characters, so I don't have to tell you what a word-joiner is, since it's a \p[Cf] character. :-) I will also gleefully ignore the existence of BOMs. So I make it: 0020;SPACE;Zs;0;WS;;;;;N;;;;; 00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;; 1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;; 180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;; 2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;; 2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;; 2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;; 2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;; 2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;; 2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;; 202F;NARROW NO-BREAK SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;; 205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;; : \s is said (in S05) to match any unicode whitespace, but letting it : match NBSP and then using \s for splitting things is wrong, I think. Perhaps the default word split should not be based on \s then. It's just one more difference, in addition to trimming leading and trailing whitespace like awk. : Are the contents of <> split using <?ws>? (Is <<$foo>>, where $foo is : "foo\xA0bar", one or two elements?) That is using the default word splitter (or it *is* the default word splitter), so if the default word split is based on <+[\s]-[\xA0]> it would be one element. Of course, the ZERO WIDTH SPACE is a nasty critter for anyone using whitespace to separate tokens. That and maybe thin spaces probably merit warnings in Perl code where they might cause visual ambiguity. Larry