Re: Dynamic rules based on start_urls for Scrapy CrawlSpider?

Simon Nizov Sun, 05 Mar 2017 03:02:32 -0800

Let me try to clarify.
The code I posted works perfectly for 1 website (homepage). It sets 2 rules 
based on that homepage.
If I now want to run it on multiple sites then usually I just add them to 
start_urls. But now, starting with the second url, the rules will no longer 
be effective because they will still reference the first start_url (which 
is homepage). I want to be able to dynamically change the rules, based on 
the iteration of start_urls.


Does that make sense?

Thanks!
Simon.



On Friday, 3 March 2017 20:06:57 UTC+2, Parth Verma wrote:
>
> Hi,
>
> If I understood your problem, simple approach could be just accepting the 
> site `homepage` and then executing the rest of the code. 
> Does this solve your problem?
>
> Parth Verma
> On Thursday, 2 March 2017 20:18:34 UTC+5:30, Simon Nizov wrote:
>>
>> Hi!
>>
>> I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go 
>> over their internal links, and scrape the contents of any external links 
>> (links with a domain different from the original domain).
>>
>> I managed to do that with 2 rules but they are based on the domain of the 
>> site being crawled. If I want to run this on multiple websites I run into a 
>> problem because I don't know which "start_url" I'm currently on so I can't 
>> change the rule appropriately.
>>
>> Here's what I came up with so far, it works for one website and I'm not 
>> sure how to apply it to a list of websites:
>> class HomepagesSpider(CrawlSpider):
>>     name = 'homepages'
>>
>>
>>     homepage = 'http://www.somesite.com'
>>
>>
>>     start_urls = [homepage]
>>
>>
>>     # strip http and www
>>     domain = homepage.replace('http://', '').replace('https://', '').
>> replace('www.', '')
>>     domain = domain[:-1] if domain[-1] == '/' else domain
>>
>>
>>     rules = (
>>         Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), 
>> callback='parse_internal', follow=True),
>>         Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), 
>> callback='parse_external', follow=False),
>>     )
>>
>>
>>     def parse_internal(self, response):
>>
>>
>>         # log internal page...
>>
>>
>>     def parse_external(self, response):
>>
>>
>>         # parse external page...
>>
>>
>> This can probably be done by just passing the start_url as an argument 
>> when calling the scraper, but I'm looking for a way to do that 
>> programmatically within the scraper itself.
>>
>>
>> Any ideas? Thanks!
>>
>> Simon.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Dynamic rules based on start_urls for Scrapy CrawlSpider?

Reply via email to