Hi Roger -- "Bos, Roger" <[EMAIL PROTECTED]> writes:
> I would like to learn how to parse a mixed text/xml document I > downloaded from the sec.gov website (see example below). I would like I'm not sure of a more robust way to extract the XML, but from inspection I wrote > ftp <- "ftp://anonymous:[EMAIL > PROTECTED]/edgar/data/1317493/0001144204-08-021221.txt" > txt <- readLines(ftp) > xmlInside <- grep("</*XML", txt) > xmlTxt <- txt[seq(xmlInside[1]+1, xmlInside[2]-1)] so that xmlTxt contains the part of the message that is XML > to parse this to get the value for each xml tag and then access it > within R, but I don't know much about xml so I don't even know where to There are several ways to proceed. I personally like the xpath query language. to do this, one might > xml <- xmlTreeParse(xmlTxt, useInternal=TRUE) > head(unlist(xpathApply(xml, "//*", xmlName))) [1] "ownershipDocument" "schemaVersion" "documentType" [4] "periodOfReport" "notSubjectToSection16" "issuer" xpathApply takes an xml document and performs a query. The query '//*' says find all nodes mataching any character string (that's the *) that are located anywhere (that's the //) below the current (in this case root) node. This gives a list of nodes; xmlName extracts the name of the node. If I wanted all nodes not subject to section 16 (sounds ominmous) I'd extract all the nodes (a list0 > node <- xpathApply(xml, "//notSubjectToSection16") and then do something with them, e.g., look at them > lapply(node, saveXML) [[1]] [1] "<notSubjectToSection16>0</notSubjectToSection16>" (not so bad, looks like nothing is not subject to section 16, that's a relief) and extract their value > lapply(node, xmlValue) In one step: > xpathApply(xml, "//notSubjectToSection16", xmlValue) ?xpathApply is a good starting place, as is http://www.w3.org/TR/xpath, especially http://www.w3.org/TR/xpath#path-abbrev Martin > start debugging the errors I am getting in this example code. Can > anyone help me get started? > > Thanks, Roger > > ftp <- > "ftp://anonymous:[EMAIL PROTECTED]/edgar/data/1317493/0001144204-08-02122 > 1.txt" > download.file(url=ftp, destfile="test2.txt") > xmlTreeParse("test2.txt") > > > ********************************************************************** * > This message is for the named person's use only. It ma...{{dropped:26}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.