If you are only concerned about the parent (one level up the tree): http://doc.scrapy.org/en/latest/topics/spider-middleware.html?highlight=referer#module-scrapy.contrib.spidermiddleware.referer
If you need the whole path look here: http://doc.scrapy.org/en/latest/topics/request-response.html?highlight=meta#scrapy.http.Request.meta You could append to an array in meta each "originating url" and keep all the path. See the case for redirects http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=redirect#module-scrapy.contrib.downloadermiddleware.redirect However for very large trees I would rather store them in the spider while traversing and use the Referer header (and maybe meta['redirect_urls']) to fill the rest of the path. You may want to dump such a tree anyway, to get a glimpse of the crawl during testing. On Monday, 10 February 2014 22:00:10 UTC+2, Michael Pastore wrote: > > I am writing a crawling spider but for each url visited and parsed, the > saved item needs to include the originating url. > > For example, lets say given the start_urls = ["http://www.A.com"] and the > initial list of urls to follow that are extracted by the SgmlLinkExtractor > are ["http://www.B.com", "http://www.C.com"], the spider engine would > then schedule a visit to www.B.com then www.C.com. When the spider > crawls > to www.B.com and the parse method extracts some data, I need the > processed item to include a field with the originating url, which in this > case is > www.A.com. > > Like a breadcrumb trail, for each call to the parse method I need to look > back on step. Is there an existing way to get this information? > > Much thanks > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
