On Sat, 2007-02-24 at 15:03 +0100, Peter Dalgaard wrote:
> Ulrich Keller <[EMAIL PROTECTED]> writes:
> 
> > I am trying to read a number of XML files using xmlTreeParse(). 
> > Unfortunately,
> > some of them are malformed in a way that makes R crash. The problem is that
> > closing tags are sometimes repeated like this:
> >
> > <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag>
> >
> > I want to preprocess the contents of the XML file using gsub() before 
> > feeding
> > them to xmlTreeParse() to clean them up, but I can't figure out how to do 
> > it.
> > What I need is something that transforms the example above into:
> >
> > <tag>value1</tag><tag>value2</tag><tag>value3</tag>
> >
> > Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in 
> > ".*".
> >
> > Thanks in advance for you ideas,
> 
> Hmm, there are things you just cannot do with RE's, and I suspect that
> this is one of them. Something involving explicit splitting of the
> strings might work, though. How's this for size?
> 
> > trim <-
>     function(x)paste(sub("</tag>.*","</tag>",x),collapse="<tag>")
> > sapply(strsplit(x,"<tag>"),trim)
> [1] "<tag>value1</tag><tag>value2</tag><tag>value3</tag>"

Does this work?

> XML
[1] "<tag>value1</tag><tag>value2</tag>some 
garbage</tag></tag><tag>value3</tag>"


> gsub("[^>]*(</tag>){2}", "", XML)
[1] "<tag>value1</tag><tag>value2</tag><tag>value3</tag>"


This looks for any characters != '>' that precedes a "</tag></tag>"
sequence. It replaces that with "".

?

Marc Schwartz

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to