On 2007-08-07, Christoph Krammer <[EMAIL PROTECTED]> wrote: > Hello everybody, > > I wanted to use re.sub to strip all HTML tags out of a given string. I > learned that there are better ways to do this without the re module, > but I would like to know why my code is not working. I use the > following: > > def stripHtml(source): > source = re.sub("[\n\r\f]", " ", source) > source = re.sub("<.*?>", "", source, re.S | re.I | re.M) > source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I) > return source > > But the result still has some tags in it. When I call the > second line multiple times, all tags disappear, but since HTML > tags cannot be overlapping, I do not understand this behavior. > There is even a difference when I omit the re.I (IGNORECASE) > option. Without this option, some tags containing only capital > letters (like </FONT>) were kept in the string when doing one > processing run but removed when doing multiple runs. > > Perhaps anyone can tell me why this regex is behaving like > this.
>>> import re >>> help(re.sub) Help on function sub in module re: sub(pattern, repl, string, count=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a callable, it's passed the match object and must return a replacement string to be used. And from the Python Library Reference for re.sub: The pattern may be a string or an RE object; if you need to specify regular expression flags, you must use a RE object, or use embedded modifiers in a pattern; for example, "sub("(?i)b+", "x", "bbbb BBBB")" returns 'x x'. The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so "sub('x*', '-', 'abc')" returns '-a-b-c-'. In other words, the fourth argument to sub is count, not a set of re flags. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list