I assume <tag> is known. This removes any occurrence </tag>.*</tag> where .* does not contain <tag> or </tag>.
The regular expression, re, matches </tag>, then does a greedy match (?U) for anything followed by </tag> but uses a zero width lookahead subexpression (?=...) for the second </tag> so that it it can be rematched again. gsubfn in package gsubfn is like the usual gsub except that instead of replacing the match with a string it passes the match to function f and then replaces the match with the output of f. See the gsubfn home page: http://code.google.com/p/gsubfn/ and vignette. library(gsubfn) text <- paste("<tag>value1</tag><tag>value2</tag>some", "garbage</tag></tag><tag>value3</tag>") re <- "</tag>((?U).*(?=</tag>))" f <- function(x) if (regexpr("<tag>", x) > 0) x else "" gsubfn(re, f, text, backref = 0, perl = TRUE) On 2/24/07, Ulrich Keller <[EMAIL PROTECTED]> wrote: > I am trying to read a number of XML files using xmlTreeParse(). Unfortunately, > some of them are malformed in a way that makes R crash. The problem is that > closing tags are sometimes repeated like this: > > <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag> > > I want to preprocess the contents of the XML file using gsub() before feeding > them to xmlTreeParse() to clean them up, but I can't figure out how to do it. > What I need is something that transforms the example above into: > > <tag>value1</tag><tag>value2</tag><tag>value3</tag> > > Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in > ".*". > > Thanks in advance for you ideas, > > Uli > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
