process_links()

Paul P Mon, 17 Mar 2014 06:12:08 -0700

All,
   I am working on getting a simple example  working where I can spider a 
site and evaluate each link as I go to modify it slightly if required. To 
begin with, I am happy with just printing out each link as it is crawled so 
I can see what is going on, my sample code is below.


  I am running this by using "scrapy runspider test.py", however this may 
be incorrect? My end goal is to have a list of spidered URLs I can hand off 
for analysis to the rest of my python program.

  I am not seeing output using my current method. Any points much 
appreciated!

Thank you

from scrapy.item import Item 
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class Test(CrawlSpider):
        name = ['www.linux.com']
        allowed_domains = 'www.linux.com'
        start_urls = ['http://www.linux.com']


        rules = (
                Rule(SgmlLinkExtractor(allow=('')), 
process_links='process_links', follow=True),
        )

def process_links(self,links):
        for link in links:
                print 'link: ', link  #I just want to print out each URL as 
it is processed for now
        return links

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

process_links()

Reply via email to