Hi Michael, "process_links" takes a list of scrapy.link.Link objects and is expected to return a list of scrapy.link.Link objects (see scrapy.link.Link class definition at https://github.com/scrapy/scrapy/blob/master/scrapy/link.py#L8 and how CrawlSpider uses process_links https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L53 )
You can look for inspiration in SgmlLinkExtractor around the _link_allowedmethod that works on one link at a time https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py#L137 and how it uses "deny_domain" you can define "process_links" as an independent method or a spider method (which is probably what you want, to hold the domain to filter out). Try something like: from urlparse import urlparse from scrapy.utils.url import url_is_from_any_domain class MySpider(CrawlSpider): ... filtered_domains = ['dmoz.org'] rules = ( Rule( SgmlLinkExtractor( allow=("((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),), process_links='link_filtering', callback="parse_items", follow= True, ), ) def link_filtering(self, links): ret = [] for link in links: parsed_url = urlparse(link.url) if not url_is_from_any_domain(parsed_url, self.filtered_domains) ret.append(link) return ret ... Hope this helps. /Paul. On Wednesday, January 8, 2014 12:45:03 AM UTC+1, Michael Pastore wrote: > The spider I am building needs to crawl sites but only follow urls to > external sites. > > The rule I am using is as follows: > > Rule (SgmlLinkExtractor(allow=( > "((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),), > callback="parse_items",follow > = True), > > and the regex does work to filter out and return a list of only those urls > that begin with mailto, news, http, etc. But, I need to be able to remove > any fully qualified links with the same domain as the request url. For > example, if the current request url is www.dmoz.org, the spider must not > follow any links that have the domain dmoz.org in them. > > I would like to use the > process_links<http://doc.scrapy.org/en/0.20/topics/spiders.html?highlight=process_links#scrapy.contrib.spiders.Rule>which > is defined as a rule parameter for filtering links, but I have been > unable to find an example of such a method in action. Primarily, what is a > method signature that works here and what parameters do I need to pass in. > > What would a method look like to which I can assign the process_links > parameter in the rule? I already have the code to filter the unwanted > links, I just need to get it into the rule. > > Much thanks. > > ** full disclosure: I am learning python and simultaneously trying to > unlearn years of static language programming > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
