Hi Michael,

"process_links" takes a list of scrapy.link.Link objects and is expected to 
return a list of scrapy.link.Link objects
(see scrapy.link.Link class definition at 
https://github.com/scrapy/scrapy/blob/master/scrapy/link.py#L8
and how CrawlSpider uses process_links 
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L53
)

You can look for inspiration in SgmlLinkExtractor around the 
_link_allowedmethod that works on one link at a time
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py#L137
and how it uses "deny_domain"

you can define "process_links" as an independent method or a spider method 
(which is probably what you want, to hold the domain to filter out).

Try something like:

from urlparse import urlparse
from scrapy.utils.url import url_is_from_any_domain

class MySpider(CrawlSpider):
    ...
    filtered_domains = ['dmoz.org']
    rules = (
        Rule(
            SgmlLinkExtractor(
                allow=("((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),),
                process_links='link_filtering',
                callback="parse_items",
                follow= True,
            ),
    )
        
    def link_filtering(self, links):
        ret = []
        for link in links:
            parsed_url = urlparse(link.url)
            if not url_is_from_any_domain(parsed_url, self.filtered_domains)
                ret.append(link)
        return ret
    ...

Hope this helps.

/Paul.

On Wednesday, January 8, 2014 12:45:03 AM UTC+1, Michael Pastore wrote:

> The spider I am building needs to crawl sites but only follow urls to 
> external sites.  
>
> The rule I am using is as follows:
>
>         Rule (SgmlLinkExtractor(allow=(
> "((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),), 
> callback="parse_items",follow
> = True),
>
> and the regex does work to filter out and return a list of only those urls 
> that begin with mailto, news, http, etc.  But, I need to be able to remove 
> any fully qualified links with the same domain as the request url.  For 
> example, if the current request url is www.dmoz.org, the spider must not 
> follow any links that have the domain dmoz.org in them. 
>
> I would like to use the 
> process_links<http://doc.scrapy.org/en/0.20/topics/spiders.html?highlight=process_links#scrapy.contrib.spiders.Rule>which
>  is defined as a rule parameter for filtering links, but I have been 
> unable to find an example of such a method in action.  Primarily, what is a 
> method signature that works here and what parameters do I need to pass in.
>
> What would a method look like to which I can assign the process_links 
> parameter in the rule?  I already have the code to filter the unwanted 
> links, I just need to get it into the rule.
>
> Much thanks.
>
> ** full disclosure:  I am learning python and simultaneously trying to 
> unlearn years of static language programming
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to