Re: 1M Page Scrape Setup

Travis Leleu Thu, 25 Sep 2014 09:46:10 -0700

Drew,

Take a look at the start_requests() method in scrapy's crawler class.
You'll override this method, and should yield the Request object for the
next page to scrape.  Ref:
http://doc.scrapy.org/en/latest/topics/spiders.html?highlight=make_request#scrapy.spider.Spider.make_requests_from_url


I like to use start_requests() when I'm pulling from a database, because
you can write the function as a generator to only pull from the db when you
need.  (I usually also mark the status as "QUEUED" in my DB once it's been
handed to scrapy, and this is a good place to put that logic.)

One gotcha with this that I've run into: if you query mongo and have a
cursor pointing to your results, that cursor will time out much quicker
than I expected.  I implemented the start_requests() as a generator,
described above.  But the cursor would time our between times retrieving
the URLs!  (You can check if the cursor is timed out and re-acquire the
result set in start_requests(), or you can move to using a queuing data
structure as I tend to prefer.)

Hope this helps.  If you get stuck with start_Requests(), feel free to send
me a link to a binpaste and I'll check it out when I have time.

Thanks,
Travis




On Thu, Sep 25, 2014 at 7:45 AM, Nicolás Alejandro Ramírez Quiros <
nramirez...@gmail.com> wrote:

> If you already have the "pins" you want to crawl, just make a file with
> them, then crawl the site. When the spider stops you calculate the
> difference between spider output and your total, and you launch the spider
> with that; you will have to repeat as many times needed.
>
> El jueves, 25 de septiembre de 2014 11:12:04 UTC-3, Drew Friestedt
> escribió:
>
>> I'm trying to setup a scrape that targets 1M unique URLs on the same
>> site.  The scrape has a proxy and captcha breaker, so it's running pretty
>> slow and it's prone to crash because the target site goes down frequently
>> (not from me scraping).  Once the 1M pages are scraped, the scrape will
>> grab about 1000 incremental urls per day.
>>
>> URL Format:
>> http://www.foo.com/000000001 #the number sequence is a 'pin'
>> http://www.foo.com/000000002
>> http://www.foo.com/000000003
>> etc..
>>
>> Does my proposed setup make sense?
>>
>> Setup mongodb with 1M pins, and a scraped flag.  For example:
>> {'pin': '000000001', 'scraped': False}
>>
>> In the scrape I would setup a query to select 10,000 pins where 'scraped'
>> = False.  I would then append 10,000 urls to start_urls[].  The resulting
>> scrape would get inserted into another collection and the pin 'scraped'
>> flag would get set to True.  After the 10,000 pins are scraped I would run
>> the scrape again until all 1M pins are scraped.
>>
>> Does this setup make sense or is there a more efficient way to do this?
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: 1M Page Scrape Setup

Reply via email to