Depth-first scrapping approach

Roberto García Thu, 10 Jan 2008 13:20:57 -0800

Dear all,

I'm writing a scraper with Solvent. It's a really nice tool. However,
I'm facing some problems due to the messy site I'm scrapping, a
research results management webapp used by some
universities in Spain.


The problem appears when I'm scrapping multiple results pages and it
is due to the fact that when I use "piggybank.scrapeURL", URLs are
queued, thus implementing a breadth-first search. The result is that,
due to problems in the site, previous multiple results pages are
masked by the following ones.

Basically, the web page has a page for each research group, then pages
for each year listing all the publications produced that year and a
page for each publication with the details.

When I scrape the first page (all years) I get the links to each year
page. They are queued by solvent. Then, each one of them is processed.
However, there might be more than one page for year and their URLs are
queued at the end of the URLs to scrape. Consequently, all years are
scraped but when I get to the following pages they are lost because
the web application I'm scrapping does not generate unique pages for
the following pages for each year.

all years --> year (1 o more pages) --> publication details

year 1
year 2
year 3
year 4
..
year 1 page 2
year 1 page 3
year 2 page 1
...

Therefore, is it possible to make it scrape the URLs as they are
called, thus implementing a depth-first search?

Thank you for your attention.

Best

--
Roberto García
http://rhizomik.net/~roberto

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Depth-first scrapping approach

Reply via email to