Dear all, I'm writing a scraper with Solvent. It's a really nice tool. However, I'm facing some problems due to the messy site I'm scrapping, a research results management webapp used by some universities in Spain.
The problem appears when I'm scrapping multiple results pages and it is due to the fact that when I use "piggybank.scrapeURL", URLs are queued, thus implementing a breadth-first search. The result is that, due to problems in the site, previous multiple results pages are masked by the following ones. Basically, the web page has a page for each research group, then pages for each year listing all the publications produced that year and a page for each publication with the details. When I scrape the first page (all years) I get the links to each year page. They are queued by solvent. Then, each one of them is processed. However, there might be more than one page for year and their URLs are queued at the end of the URLs to scrape. Consequently, all years are scraped but when I get to the following pages they are lost because the web application I'm scrapping does not generate unique pages for the following pages for each year. all years --> year (1 o more pages) --> publication details year 1 year 2 year 3 year 4 .. year 1 page 2 year 1 page 3 year 2 page 1 ... Therefore, is it possible to make it scrape the URLs as they are called, thus implementing a depth-first search? Thank you for your attention. Best -- Roberto García http://rhizomik.net/~roberto _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
