Hi! I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain).
I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately. Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites: class HomepagesSpider(CrawlSpider): name = 'homepages' homepage = 'http://www.somesite.com' start_urls = [homepage] # strip http and www domain = homepage.replace('http://', '').replace('https://', '').replace ('www.', '') domain = domain[:-1] if domain[-1] == '/' else domain rules = ( Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True), Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False), ) def parse_internal(self, response): # log internal page... def parse_external(self, response): # parse external page... This can probably be done by just passing the start_url as an argument when calling the scraper, but I'm looking for a way to do that programmatically within the scraper itself. Any ideas? Thanks! Simon. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.