Thanks Gabor, Nifty regexp. I never used strapplyc before and I am sure this will become a nice addition to my toolkit.
KW Message: 5 Date: Tue, 15 May 2012 07:55:33 -0400 From: Gabor Grothendieck <ggrothendi...@gmail.com> To: Keith Weintraub <kw1...@gmail.com> Cc: r-help@r-project.org Subject: Re: [R] Scraping a web page. Message-ID: <CAP01uR=zdxHocxpsZdpT+4Kx2=L2vr9jnr=i=_Qhs39O=qo...@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 On Tue, May 15, 2012 at 7:06 AM, Keith Weintraub <kw1...@gmail.com> wrote: > > Thanks, > ?That was very helpful. > > I am using readLines and grep. If grep isn't powerful enough I might end up > using the XML package but I hope that won't be necessary. > This only uses readLines and strapplyc (from gsubfn). It scrape the relevant strings from your post on nabble and by modifying URL and pat you can likely get it to work with whatever the format of your original files is: library(gsubfn) URL <- "http://r.789695.n4.nabble.com/Scraping-a-web-page-tp4630005.html" L <- readLines(URL) pat <- '<br/>"/en/Ships.*-(\\d{7}).html"' strapplyc(L, pat, simplify = c) The result from the last line is: [1] "8605507" "8122830" -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -- [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.