I am writing a crawling spider but for each url visited and parsed, the saved item needs to include the originating url.
For example, lets say given the start_urls = ["http://www.A.com"] and the initial list of urls to follow that are extracted by the SgmlLinkExtractor are ["http://www.B.com", "http://www.C.com"], the spider engine would then schedule a visit to www.B.com then www.C.com. When the spider crawls to www.B.com and the parse method extracts some data, I need the processed item to include a field with the originating url, which in this case is www.A.com. Like a breadcrumb trail, for each call to the parse method I need to look back on step. Is there an existing way to get this information? Much thanks -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
