Thanks,
  That was very helpful.

I am using readLines and grep. If grep isn't powerful enough I might end up 
using the XML package but I hope that won't be necessary.

Thanks again,
KW

--

On May 14, 2012, at 7:18 PM, J Toll wrote:

> On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub <kw1...@gmail.com> wrote:
>> Folks,
>>  I want to scrape a series of web-page sources for strings like the 
>> following:
>> 
>> "/en/Ships/A-8605507.html"
>> "/en/Ships/Aalborg-8122830.html"
>> 
>> which appear in an href inside an <a> tag inside a <div> tag inside a table.
>> 
>> In fact all I want is the (exactly) 7-digit number before ".html".
>> 
>> The good news is that as far as I can tell the the <a> tag is always on it's 
>> own line so some kind of line-by-line grep should suffice once I figure out 
>> the following:
>> 
>> What is the best package/command to use to get the source of a web page. I 
>> tried using something like:
>> if(url.exists("http://www.omegahat.org/RCurl";)) {
>>  h = basicTextGatherer()
>>  curlPerform(url = "http://www.omegahat.org/RCurl";, writefunction = h$update)
>>   # Now read the text that was cumulated during the query response.
>>  h$value()
>> }
>> 
>> which works except that I get one long streamed html doc without the line 
>> breaks.
> 
> You could use:
> 
> h <- readLines("http://www.omegahat.org/RCurl";)
> 
> -- or --
> 
> download.file(url = "http://www.omegahat.org/RCurl";, destfile = "tmp.html")
> h = scan("tmp.html", what = "", sep = "\n")
> 
> and then use grep or the XML package for processing.
> 
> HTH
> 
> James


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to