Ulrich Keller wrote: > I am trying to read a number of XML files using xmlTreeParse(). Unfortunately, > some of them are malformed in a way that makes R crash. The problem is that > closing tags are sometimes repeated like this: > > <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag> > > I want to preprocess the contents of the XML file using gsub() before feeding > them to xmlTreeParse() to clean them up, but I can't figure out how to do it. > What I need is something that transforms the example above into: > > <tag>value1</tag><tag>value2</tag><tag>value3</tag> > > Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in > ".*". > > Thanks in advance for you ideas,
Instead of using xmlTreeParse() which really expects well-formed XML, and assuming you cannot have the XML generation mechanism fixed, you might try to use htmlTreeParse(). While the name suggests it is for HTML, it is really a "relaxed" XML parser that is capable of handling malformed XML. This typically occurs in HTML and hence the name. Of course, since the XML is malformed, the results will be hard to predict as it is hard to make sense of "non-sense". If xmlTreeParse() is actually causing R to exit (i.e. what some people refer to as crashing), as Jeff (Horner) said, we would like to be able to stop this. We will need the actual text/file passed to xmlTreeParse(), version information of operating system, R and the XML package and any locale information. However, if by crashing you mean generates an error, then that is expected on malformed XML inputs. HTH, D. > > Uli > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Duncan Temple Lang [EMAIL PROTECTED] Department of Statistics work: (530) 752-4782 4210 Mathematical Sciences Bldg. fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA
pgpiAOqBiNtsZ.pgp
Description: PGP signature
______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
