David Morris wrote: > Is there a tutorial yet on how to scrape a site with multiple pages > yet? I would like to scrape a site like allrecipes.com > <http://allrecipes.com> or epicurious.com <http://epicurious.com>, and > I can't figure out what to do to scrape multiple pages. In the script templates under "Insert" on the left hand panel, there is a menu item called Code to Scrape several pages.
The main point is this function: piggybank.scrapeURL(url, scrapePage, failure); what you need to do is: 1) Collect the urls of the pages you want to scrape from the current page; utilities.collectURLsWithSubstring(doc, substring) may be helpful. 2) iterate over your array of urls , running each of them through piggybank.scrapeURL(url, scrapePage, failure); 3) In your scrapePage function, put the code to scrape your data, as you do for a single page. HTH Keith const rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; const dc = "http://purl.org/dc/elements/1.1/"; const loc = "http://simile.mit.edu/2005/05/ontologies/location#"; // add other useful namespace prefixes here // This function scrapes a page var scrapePage = function(document) { var uri = document.location.href; data.addStatement(uri, dc + "title", document.title, true); } // This function should return an array of URLs (strings) // of other pages to scrape (e.g., subsequent search results // pages). It should not include the URL of the current // page. var gatherPagesToScrape = function(document) { return []; } // This function is called if there is a failure in any // the subscraping invocations var failure = function(e) { alert("Error occurred: " + e); }; // ========================================================= // first scrape the current page scrapePage(document); // then gather the next pages to scrape var urls = gatherPagesToScrape(document); // and tell piggy bank to scrape them (and what function should do it) for each (var url in urls) { piggybank.scrapeURL(url, scrapePage, failure); } _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
