All,
   Just writing a note incase this helps anyone else, I managed to get it 
working with the following code, it must have been my Rules with an empty 
allow() which were not working:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item

class MySpider(CrawlSpider):
        name = 'linux.com'
        allowed_domains = ['linux.com']
        start_urls = ['http://www.linux.com']

        rules = [
                Rule(SgmlLinkExtractor(allow=('.+')), follow=True, 
callback='parse_item', process_links='process_links'),
                Rule(SgmlLinkExtractor(allow=('.+')), 
callback='parse_item', process_links='process_links')
            ]

        def process_links(self, links):
                spiderList = []
                for link in links:
                        print 'Testing link: ', link.url
                        #modify the link however you like here...
                        spiderList.append(link)
                return spiderList 


On Sunday, 16 March 2014 17:54:53 UTC+11, Paul P wrote:
>
> Hello All,
>   I have been reading the scrapy documentation and mailing lists but 
> cannot find an example which works. I don't find the documentation too 
> helpful for using process_links().
>
>   All I need to do is analyse each URL as it is processed and make a 
> modification to it (in certain circumstances) before passing it back to 
> scrapy for spidering.
>
>   As a test, I would just like to print out the URL as it is being 
> processed but I cannot even get that to work, example code below which I am 
> calling with:  "scrapy runspider test.py" or should I be calling is 
> differently? my goal is to create a list of URLs which can be passed to the 
> rest of my python code for analysis.
>
> from scrapy.item import Item
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> from scrapy.selector import Selector
>
> class Demo(CrawlSpider):
>         name = ['www.linux.com']
>         allowed_domains = 'www.linux.com'
>         start_urls = ['http://www.linux.com']
>
>
>         rules = (
>                 Rule(SgmlLinkExtractor(allow=('')), 
> process_links='process_links', follow=True),
>         )
>
> def process_links(self,links):
>         for link in links:
>                 print 'link: ', link  #I just want to print out each URL 
> as it is processed for now
>         return links
>
>
> Thank you!
> Paul.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to