On Sun, 2007-07-22 at 22:33 +0200, Peter Kleiweg wrote:
> >>> import re
> >>> s = u'a b\u00A0c d'
> >>> s.split()
> [u'a', u'b', u'c', u'd']
> >>> re.findall(r'\S+', s)
> [u'a', u'b\xa0c', u'd']
>
If you want the Unicode interpretation of \S+, etc, you pass the
re.UNICODE flag:
>>> re.findall(r'\S+', s,re.UNICODE)
[u'a', u'b', u'c', u'd']
See http://docs.python.org/lib/node46.html
>
> This isn't documented either:
>
> >>> s = ' b c '
> >>> s.split()
> ['b', 'c']
> >>> s.split(' ')
> ['', 'b', 'c', '']
I believe the following documents it accurately:
http://docs.python.org/lib/string-methods.html
If sep is not specified or is None, a different splitting
algorithm is applied. First, whitespace characters (spaces,
tabs, newlines, returns, and formfeeds) are stripped from both
ends. Then, words are separated by arbitrary length strings of
whitespace characters. Consecutive whitespace delimiters are
treated as a single delimiter ("'1 2 3'.split()" returns "['1',
'2', '3']"). Splitting an empty string or a string consisting of
just whitespace returns an empty list.
--
http://mail.python.org/mailman/listinfo/python-list