Hi, If I understood your problem, simple approach could be just accepting the site `homepage` and then executing the rest of the code. Does this solve your problem?
Parth Verma On Thursday, 2 March 2017 20:18:34 UTC+5:30, Simon Nizov wrote: > > Hi! > > I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over > their internal links, and scrape the contents of any external links (links > with a domain different from the original domain). > > I managed to do that with 2 rules but they are based on the domain of the > site being crawled. If I want to run this on multiple websites I run into a > problem because I don't know which "start_url" I'm currently on so I can't > change the rule appropriately. > > Here's what I came up with so far, it works for one website and I'm not > sure how to apply it to a list of websites: > class HomepagesSpider(CrawlSpider): > name = 'homepages' > > > homepage = 'http://www.somesite.com' > > > start_urls = [homepage] > > > # strip http and www > domain = homepage.replace('http://', '').replace('https://', ''). > replace('www.', '') > domain = domain[:-1] if domain[-1] == '/' else domain > > > rules = ( > Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), > callback='parse_internal', follow=True), > Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), > callback='parse_external', follow=False), > ) > > > def parse_internal(self, response): > > > # log internal page... > > > def parse_external(self, response): > > > # parse external page... > > > This can probably be done by just passing the start_url as an argument > when calling the scraper, but I'm looking for a way to do that > programmatically within the scraper itself. > > > Any ideas? Thanks! > > Simon. > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.