Thanks, That was very helpful. I am using readLines and grep. If grep isn't powerful enough I might end up using the XML package but I hope that won't be necessary.
Thanks again, KW -- On May 14, 2012, at 7:18 PM, J Toll wrote: > On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub <kw1...@gmail.com> wrote: >> Folks, >> I want to scrape a series of web-page sources for strings like the >> following: >> >> "/en/Ships/A-8605507.html" >> "/en/Ships/Aalborg-8122830.html" >> >> which appear in an href inside an <a> tag inside a <div> tag inside a table. >> >> In fact all I want is the (exactly) 7-digit number before ".html". >> >> The good news is that as far as I can tell the the <a> tag is always on it's >> own line so some kind of line-by-line grep should suffice once I figure out >> the following: >> >> What is the best package/command to use to get the source of a web page. I >> tried using something like: >> if(url.exists("http://www.omegahat.org/RCurl")) { >> h = basicTextGatherer() >> curlPerform(url = "http://www.omegahat.org/RCurl", writefunction = h$update) >> # Now read the text that was cumulated during the query response. >> h$value() >> } >> >> which works except that I get one long streamed html doc without the line >> breaks. > > You could use: > > h <- readLines("http://www.omegahat.org/RCurl") > > -- or -- > > download.file(url = "http://www.omegahat.org/RCurl", destfile = "tmp.html") > h = scan("tmp.html", what = "", sep = "\n") > > and then use grep or the XML package for processing. > > HTH > > James [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.