Nikolaos,
Perfect! The Referer Middleware was just what I was looking for (I only
needed to capture the referring url and not the entire breadcrumb trail).
It took me a bit of reading through posts to figure out how to actually
retrieve the referring url, and the basics are below:
Add to your settings file:
SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}
Then in your spider parser use the following to access the referring url:
response.request.headers.get('Referer', None) #btw: 'Referer' is the
correct usage, 'Referrer' will not work
Thanks again!
On Monday, February 10, 2014 3:00:10 PM UTC-5, Michael Pastore wrote:
>
> I am writing a crawling spider but for each url visited and parsed, the
> saved item needs to include the originating url.
>
> For example, lets say given the start_urls = ["http://www.A.com"] and the
> initial list of urls to follow that are extracted by the SgmlLinkExtractor
> are ["http://www.B.com", "http://www.C.com"], the spider engine would
> then schedule a visit to www.B.com then www.C.com. When the spider
> crawls
> to www.B.com and the parse method extracts some data, I need the
> processed item to include a field with the originating url, which in this
> case is
> www.A.com.
>
> Like a breadcrumb trail, for each call to the parse method I need to look
> back on step. Is there an existing way to get this information?
>
> Much thanks
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.