Drew, Take a look at the start_requests() method in scrapy's crawler class. You'll override this method, and should yield the Request object for the next page to scrape. Ref: http://doc.scrapy.org/en/latest/topics/spiders.html?highlight=make_request#scrapy.spider.Spider.make_requests_from_url
I like to use start_requests() when I'm pulling from a database, because you can write the function as a generator to only pull from the db when you need. (I usually also mark the status as "QUEUED" in my DB once it's been handed to scrapy, and this is a good place to put that logic.) One gotcha with this that I've run into: if you query mongo and have a cursor pointing to your results, that cursor will time out much quicker than I expected. I implemented the start_requests() as a generator, described above. But the cursor would time our between times retrieving the URLs! (You can check if the cursor is timed out and re-acquire the result set in start_requests(), or you can move to using a queuing data structure as I tend to prefer.) Hope this helps. If you get stuck with start_Requests(), feel free to send me a link to a binpaste and I'll check it out when I have time. Thanks, Travis On Thu, Sep 25, 2014 at 7:45 AM, Nicolás Alejandro Ramírez Quiros < nramirez...@gmail.com> wrote: > If you already have the "pins" you want to crawl, just make a file with > them, then crawl the site. When the spider stops you calculate the > difference between spider output and your total, and you launch the spider > with that; you will have to repeat as many times needed. > > El jueves, 25 de septiembre de 2014 11:12:04 UTC-3, Drew Friestedt > escribió: > >> I'm trying to setup a scrape that targets 1M unique URLs on the same >> site. The scrape has a proxy and captcha breaker, so it's running pretty >> slow and it's prone to crash because the target site goes down frequently >> (not from me scraping). Once the 1M pages are scraped, the scrape will >> grab about 1000 incremental urls per day. >> >> URL Format: >> http://www.foo.com/000000001 #the number sequence is a 'pin' >> http://www.foo.com/000000002 >> http://www.foo.com/000000003 >> etc.. >> >> Does my proposed setup make sense? >> >> Setup mongodb with 1M pins, and a scraped flag. For example: >> {'pin': '000000001', 'scraped': False} >> >> In the scrape I would setup a query to select 10,000 pins where 'scraped' >> = False. I would then append 10,000 urls to start_urls[]. The resulting >> scrape would get inserted into another collection and the pin 'scraped' >> flag would get set to True. After the 10,000 pins are scraped I would run >> the scrape again until all 1M pins are scraped. >> >> Does this setup make sense or is there a more efficient way to do this? >> > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to scrapy-users+unsubscr...@googlegroups.com. > To post to this group, send email to scrapy-users@googlegroups.com. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.