Hi everyone,
I am writing Scrapy Spider which will crawl about 1000 domains. I am
thinking if there is any way to track number of domains crawled. because it
will take long time to crawl 1000 domains in one using process.
if I could track number of domains process then I can trigger some task
like sending email after crawling of 100 domains out of 1000.
I tried to find on internet but could not get relevant.
if anyone know someway please tell me. if I would not find any way then I
have to track number of urls crawled. but it would be good if number of
domains can be tracked.
class MySpider(CrawlSpider):
name = 'alok2'
# 'list.txt' file have domains which I have to crawl
allowed_domains = [i.split('\n')[0] for i in
open('list.txt','r').readlines()]
start_urls = ['http://'+i.split('\n')[0] for i in
open('list.txt','r').readlines()]
rules = [Rule(SgmlLinkExtractor(), callback='parse_item',follow=True)]
def __init__(self,category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.count=0 #this is to keep track of domains whose all links have
been crawled
def parse_start_url(self, response):
self.parse_item(response)
def parse_item(self, response):
#lines
#lines
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.