Ashley Moran <[email protected]> writes:
> I was thinking that any non-whitespace character *at all* would have
> to be preserved, so:
>
> /([\s]*)(([^\s]+)($|[\s]+))+/
To reliably know what is a whitespace character, you must know what the
file encoding is. For example, assuming that ASCII whitespace bytes
[\r\t\v\n ] are whitespace will not work with a UTF-16 encoded file.
Even for UTF-8, there are more whitespace codepoints than those in the
ASCII compatibility area. There is the Zs (Separator, Space) category,
which includes
000020 SPACE
0000a0 NO-BREAK SPACE
001680 OGHAM SPACE MARK
00180e MONGOLIAN VOWEL SEPARATOR
002000 EN QUAD
002001 EM QUAD
002002 EN SPACE
002003 EM SPACE
002004 THREE-PER-EM SPACE
002005 FOUR-PER-EM SPACE
002006 SIX-PER-EM SPACE
002007 FIGURE SPACE
002008 PUNCTUATION SPACE
002009 THIN SPACE
00200a HAIR SPACE
00202f NARROW NO-BREAK SPACE
00205f MEDIUM MATHEMATICAL SPACE
003000 IDEOGRAPHIC SPACE
_______________________________________________
darcs-users mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-users