Ulrich Keller <[EMAIL PROTECTED]> writes:

> I am trying to read a number of XML files using xmlTreeParse(). Unfortunately,
> some of them are malformed in a way that makes R crash. The problem is that
> closing tags are sometimes repeated like this:
>
> <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag>
>
> I want to preprocess the contents of the XML file using gsub() before feeding
> them to xmlTreeParse() to clean them up, but I can't figure out how to do it.
> What I need is something that transforms the example above into:
>
> <tag>value1</tag><tag>value2</tag><tag>value3</tag>
>
> Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in 
> ".*".
>
> Thanks in advance for you ideas,

Hmm, there are things you just cannot do with RE's, and I suspect that
this is one of them. Something involving explicit splitting of the
strings might work, though. How's this for size?

> trim <-
    function(x)paste(sub("</tag>.*","</tag>",x),collapse="<tag>")
> sapply(strsplit(x,"<tag>"),trim)
[1] "<tag>value1</tag><tag>value2</tag><tag>value3</tag>"


-- 
   O__  ---- Peter Dalgaard             Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([EMAIL PROTECTED])                  FAX: (+45) 35327907

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to