Nikolaos,

Perfect! The Referer Middleware was just what I was looking for (I only 
needed to capture the referring url and not the entire breadcrumb trail).

It took me a bit of reading through posts to figure out how to actually 
retrieve the referring url, and the basics are below:

Add to your settings file:

SPIDER_MIDDLEWARES = {

'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}


Then in your spider parser use the following to access the referring url:

response.request.headers.get('Referer', None) #btw: 'Referer' is the 
correct usage, 'Referrer' will not work

Thanks again!

On Monday, February 10, 2014 3:00:10 PM UTC-5, Michael Pastore wrote:
>
> I am writing a crawling spider but for each url visited and parsed, the 
> saved item needs to include the originating url.  
>
> For example, lets say given the start_urls = ["http://www.A.com";] and the 
> initial list of urls to follow that are extracted by the SgmlLinkExtractor
> are ["http://www.B.com";, "http://www.C.com";], the spider engine would 
> then schedule a visit to www.B.com then www.C.com.  When the spider 
> crawls 
> to www.B.com and the parse method extracts some data, I need the 
> processed item to include a field with the originating url, which in this 
> case is
> www.A.com.  
>
> Like a breadcrumb trail, for each call to the parse method I need to look 
> back on step. Is there an existing way to get this information? 
>
> Much thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to