The spider I am building needs to crawl sites but only follow urls to 
external sites.  

The rule I am using is as follows:

        Rule (SgmlLinkExtractor(allow=(
"((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),), callback="parse_items",follow
= True),

and the regex does work to filter out and return a list of only those urls 
that begin with mailto, news, http, etc.  But, I need to be able to remove 
any fully qualified links with the same domain as the request url.  For 
example, if the current request url is www.dmoz.org, the spider must not 
follow any links that have the domain dmoz.org in them. 

I would like to use the 
process_links<http://doc.scrapy.org/en/0.20/topics/spiders.html?highlight=process_links#scrapy.contrib.spiders.Rule>which
 is defined as a rule parameter for filtering links, but I have been 
unable to find an example of such a method in action.  Primarily, what is a 
method signature that works here and what parameters do I need to pass in.

What would a method look like to which I can assign the process_links 
parameter in the rule?  I already have the code to filter the unwanted 
links, I just need to get it into the rule.

Much thanks.

** full disclosure:  I am learning python and simultaneously trying to 
unlearn years of static language programming

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to