Let me try to clarify. The code I posted works perfectly for 1 website (homepage). It sets 2 rules based on that homepage. If I now want to run it on multiple sites then usually I just add them to start_urls. But now, starting with the second url, the rules will no longer be effective because they will still reference the first start_url (which is homepage). I want to be able to dynamically change the rules, based on the iteration of start_urls.
Does that make sense? Thanks! Simon. On Friday, 3 March 2017 20:06:57 UTC+2, Parth Verma wrote: > > Hi, > > If I understood your problem, simple approach could be just accepting the > site `homepage` and then executing the rest of the code. > Does this solve your problem? > > Parth Verma > On Thursday, 2 March 2017 20:18:34 UTC+5:30, Simon Nizov wrote: >> >> Hi! >> >> I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go >> over their internal links, and scrape the contents of any external links >> (links with a domain different from the original domain). >> >> I managed to do that with 2 rules but they are based on the domain of the >> site being crawled. If I want to run this on multiple websites I run into a >> problem because I don't know which "start_url" I'm currently on so I can't >> change the rule appropriately. >> >> Here's what I came up with so far, it works for one website and I'm not >> sure how to apply it to a list of websites: >> class HomepagesSpider(CrawlSpider): >> name = 'homepages' >> >> >> homepage = 'http://www.somesite.com' >> >> >> start_urls = [homepage] >> >> >> # strip http and www >> domain = homepage.replace('http://', '').replace('https://', ''). >> replace('www.', '') >> domain = domain[:-1] if domain[-1] == '/' else domain >> >> >> rules = ( >> Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), >> callback='parse_internal', follow=True), >> Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), >> callback='parse_external', follow=False), >> ) >> >> >> def parse_internal(self, response): >> >> >> # log internal page... >> >> >> def parse_external(self, response): >> >> >> # parse external page... >> >> >> This can probably be done by just passing the start_url as an argument >> when calling the scraper, but I'm looking for a way to do that >> programmatically within the scraper itself. >> >> >> Any ideas? Thanks! >> >> Simon. >> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.