Re: Depth-first scrapping approach

Ryan Lee Wed, 16 Jan 2008 10:25:09 -0800

Roberto García wrote:
> The problem appears when I'm scrapping multiple results pages and it
> is due to the fact that when I use "piggybank.scrapeURL", URLs are
> queued, thus implementing a breadth-first search. The result is that,
> due to problems in the site, previous multiple results pages are
> masked by the following ones.


Perhaps to scrape it you could run through the breadth first queue 
simply as an exercise for re-queuing?  That is, scrape year 1, page 1 
solely to uncover that it has three pages, and requeue all three for 
later, detailed scraping.

Clearly it's going to take longer, and may even be a terrible suggestion 
if you've got a few decades to work through.

Let us know how it goes.

-- 
Ryan Lee                  [EMAIL PROTECTED]
MIT CSAIL Research Staff  http://simile.mit.edu/
http://people.csail.mit.edu/ryanlee/
_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Re: Depth-first scrapping approach

Reply via email to