Ulrich Keller <[EMAIL PROTECTED]> writes: > I am trying to read a number of XML files using xmlTreeParse(). Unfortunately, > some of them are malformed in a way that makes R crash. The problem is that > closing tags are sometimes repeated like this: > > <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag> > > I want to preprocess the contents of the XML file using gsub() before feeding > them to xmlTreeParse() to clean them up, but I can't figure out how to do it. > What I need is something that transforms the example above into: > > <tag>value1</tag><tag>value2</tag><tag>value3</tag> > > Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in > ".*". > > Thanks in advance for you ideas,
Hmm, there are things you just cannot do with RE's, and I suspect that this is one of them. Something involving explicit splitting of the strings might work, though. How's this for size? > trim <- function(x)paste(sub("</tag>.*","</tag>",x),collapse="<tag>") > sapply(strsplit(x,"<tag>"),trim) [1] "<tag>value1</tag><tag>value2</tag><tag>value3</tag>" -- O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.