Re: Solvent Tutorial for Scraping Multiple pages

Keith Alexander Tue, 06 Feb 2007 14:51:50 -0800

David Morris wrote:
> Is there a tutorial yet on how to scrape a site with multiple pages 
> yet? I would like to scrape a site like allrecipes.com 
> <http://allrecipes.com> or epicurious.com <http://epicurious.com>, and 
> I can't figure out what to do to scrape multiple pages.
In the script templates under "Insert" on the left hand panel, there is 
a menu item called Code to Scrape several pages.


The main point is this function:     piggybank.scrapeURL(url, 
scrapePage, failure);

what you need to do is:

1) Collect the urls of the pages you want to scrape from the current 
page; utilities.collectURLsWithSubstring(doc, substring) may be helpful.
2) iterate over your array of urls , running each of them through 
piggybank.scrapeURL(url, scrapePage, failure);
3) In your scrapePage function, put the code to scrape your data, as you 
do for a single page.

HTH

Keith

const rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";;
const dc  = "http://purl.org/dc/elements/1.1/";;
const loc = "http://simile.mit.edu/2005/05/ontologies/location#";;
// add other useful namespace prefixes here
 
// This function scrapes a page
var scrapePage = function(document) {

    var uri = document.location.href;
   
    data.addStatement(uri, dc + "title", document.title, true);
     
}

// This function should return an array of URLs (strings)
// of other pages to scrape (e.g., subsequent search results
// pages). It should not include the URL of the current
// page.
var gatherPagesToScrape = function(document) {
  return [];
}

// This function is called if there is a failure in any
// the subscraping invocations
var failure = function(e) {
    alert("Error occurred: " + e);
};

// =========================================================

// first scrape the current page
scrapePage(document);

// then gather the next pages to scrape
var urls = gatherPagesToScrape(document);

// and tell piggy bank to scrape them (and what function should do it)
for each (var url in urls) {
    piggybank.scrapeURL(url, scrapePage, failure);
}

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Re: Solvent Tutorial for Scraping Multiple pages

Reply via email to