On Tue, 07 Aug 2007 10:28:24 -0700, Christoph Krammer wrote: > Hello everybody, > > I wanted to use re.sub to strip all HTML tags out of a given string. I > learned that there are better ways to do this without the re module, > but I would like to know why my code is not working. I use the > following: > > def stripHtml(source): > source = re.sub("[\n\r\f]", " ", source) > source = re.sub("<.*?>", "", source, re.S | re.I | re.M) > source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I) > return source > > But the result still has some tags in it. When I call the second line > multiple times, all tags disappear, but since HTML tags cannot be > overlapping, I do not understand this behavior. There is even a > difference when I omit the re.I (IGNORECASE) option. Without this > option, some tags containing only capital letters (like </FONT>) were > kept in the string when doing one processing run but removed when > doing multiple runs.
Can you give some example HTML where it fails? Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list