The spider I am building needs to crawl sites but only follow urls to
external sites.
The rule I am using is as follows:
Rule (SgmlLinkExtractor(allow=(
"((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)", ),), callback="parse_items",follow
= True),
and the regex does work to filter out and return a list of only those urls
that begin with mailto, news, http, etc. But, I need to be able to remove
any fully qualified links with the same domain as the request url. For
example, if the current request url is www.dmoz.org, the spider must not
follow any links that have the domain dmoz.org in them.
I would like to use the
process_links<http://doc.scrapy.org/en/0.20/topics/spiders.html?highlight=process_links#scrapy.contrib.spiders.Rule>which
is defined as a rule parameter for filtering links, but I have been
unable to find an example of such a method in action. Primarily, what is a
method signature that works here and what parameters do I need to pass in.
What would a method look like to which I can assign the process_links
parameter in the rule? I already have the code to filter the unwanted
links, I just need to get it into the rule.
Much thanks.
** full disclosure: I am learning python and simultaneously trying to
unlearn years of static language programming
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.