If you are only concerned about the parent (one level up the tree): 
http://doc.scrapy.org/en/latest/topics/spider-middleware.html?highlight=referer#module-scrapy.contrib.spidermiddleware.referer

If you need the whole path look here: 
http://doc.scrapy.org/en/latest/topics/request-response.html?highlight=meta#scrapy.http.Request.meta
You could append to an array in meta each "originating url" and keep all 
the path. See the case for redirects 
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=redirect#module-scrapy.contrib.downloadermiddleware.redirect
However for very large trees I would rather store them in the spider while 
traversing and use the Referer header (and maybe meta['redirect_urls']) to 
fill the rest of the path. You may want to dump such a tree anyway, to get 
a glimpse of the crawl during testing.

On Monday, 10 February 2014 22:00:10 UTC+2, Michael Pastore wrote:
>
> I am writing a crawling spider but for each url visited and parsed, the 
> saved item needs to include the originating url.  
>
> For example, lets say given the start_urls = ["http://www.A.com";] and the 
> initial list of urls to follow that are extracted by the SgmlLinkExtractor
> are ["http://www.B.com";, "http://www.C.com";], the spider engine would 
> then schedule a visit to www.B.com then www.C.com.  When the spider 
> crawls 
> to www.B.com and the parse method extracts some data, I need the 
> processed item to include a field with the originating url, which in this 
> case is
> www.A.com.  
>
> Like a breadcrumb trail, for each call to the parse method I need to look 
> back on step. Is there an existing way to get this information? 
>
> Much thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to