Ashley Moran <[email protected]> writes:

> I was thinking that any non-whitespace character *at all* would have
> to be preserved, so:
>
>     /([\s]*)(([^\s]+)($|[\s]+))+/

To reliably know what is a whitespace character, you must know what the
file encoding is.  For example, assuming that ASCII whitespace bytes
[\r\t\v\n ] are whitespace will not work with a UTF-16 encoded file.

Even for UTF-8, there are more whitespace codepoints than those in the
ASCII compatibility area.  There is the Zs (Separator, Space) category,
which includes

    000020 SPACE
    0000a0 NO-BREAK SPACE
    001680 OGHAM SPACE MARK
    00180e MONGOLIAN VOWEL SEPARATOR
    002000 EN QUAD
    002001 EM QUAD
    002002 EN SPACE
    002003 EM SPACE
    002004 THREE-PER-EM SPACE
    002005 FOUR-PER-EM SPACE
    002006 SIX-PER-EM SPACE
    002007 FIGURE SPACE
    002008 PUNCTUATION SPACE
    002009 THIN SPACE
    00200a HAIR SPACE
    00202f NARROW NO-BREAK SPACE
    00205f MEDIUM MATHEMATICAL SPACE
    003000 IDEOGRAPHIC SPACE

_______________________________________________
darcs-users mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-users

Reply via email to