Hi,

If I understood your problem, simple approach could be just accepting the 
site `homepage` and then executing the rest of the code. 
Does this solve your problem?

Parth Verma
On Thursday, 2 March 2017 20:18:34 UTC+5:30, Simon Nizov wrote:
>
> Hi!
>
> I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over 
> their internal links, and scrape the contents of any external links (links 
> with a domain different from the original domain).
>
> I managed to do that with 2 rules but they are based on the domain of the 
> site being crawled. If I want to run this on multiple websites I run into a 
> problem because I don't know which "start_url" I'm currently on so I can't 
> change the rule appropriately.
>
> Here's what I came up with so far, it works for one website and I'm not 
> sure how to apply it to a list of websites:
> class HomepagesSpider(CrawlSpider):
>     name = 'homepages'
>
>
>     homepage = 'http://www.somesite.com'
>
>
>     start_urls = [homepage]
>
>
>     # strip http and www
>     domain = homepage.replace('http://', '').replace('https://', '').
> replace('www.', '')
>     domain = domain[:-1] if domain[-1] == '/' else domain
>
>
>     rules = (
>         Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), 
> callback='parse_internal', follow=True),
>         Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), 
> callback='parse_external', follow=False),
>     )
>
>
>     def parse_internal(self, response):
>
>
>         # log internal page...
>
>
>     def parse_external(self, response):
>
>
>         # parse external page...
>
>
> This can probably be done by just passing the start_url as an argument 
> when calling the scraper, but I'm looking for a way to do that 
> programmatically within the scraper itself.
>
>
> Any ideas? Thanks!
>
> Simon.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to