You shouldn't return/yield the item until it's complete. In other words, you should return the item in the "get_url_property" callback, not the main one. Each item must be returned only once, and only once its data has been fully populated.
On Wed, Mar 5, 2014 at 11:47 AM, Joey Espinosa <jlouis.espin...@gmail.com>wrote: > HOLY TYPOS. Sorry. Revised: > > class SiteSpider(Spider): > site_loader = SiteLoader > ... > def parse(self, response): > item = Place() > sel = Selector(response) > bl = self.site_loader(item=item, selector=sel) > bl.add_value('domain', self.parent_domain) > bl.add_value('origin', response.url) > for place_property in item.fields: > > parse_xpath = self.template.get(place_property) > > > # parse_xpath will look like either: > # '//path/to/property/text()' > # {'url': '//a[@id="Location"]/@href', 'xpath': > '//div[@class="directions"]/span[contains(@class, "address")]/text()'} > if isinstance(parse_xpath, dict): # if True, then this > place_property is in another URL > url = sel.xpath(parse_xpath['url_elem']).extract() > yield Request(url, callback=self.get_url_property, meta={ > 'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property > }) > else: # process normally > > bl.add_xpath(place_property, parse_xpath) > > yield bl.load_item() > > def get_url_property(self, response): > loader = response.meta['loader'] > parse_xpath = response.meta['parse_xpath'] > place_property = response.meta['place_property'] > sel = Selector(response) > loader.add_value(place_property, sel.xpath(parse_xpath['xpath']) > return loader > > > -- > Joey "JoeLinux" Espinosa > > <http://therealjoelinux.blogspot.com/><http://twitter.com/therealjoelinux><http://about.me/joelinux> > > On Wednesday, March 5, 2014 8:41:12 AM UTC-5, Joey Espinosa wrote: >> >> Hey guys, >> >> Disclaimer: I'm new to this group, and fairly new to Scrapy as well (but >> certainly not Python). >> >> Here is the issue I'm having. In my Scrapy project, I point to a page and >> hopefully grab everything I need for the item. However, some domains (I'm >> scraping a significant amount of separate domains) have certain item >> properties located in another page within the initial page (for example, >> "location" might only be found by clicking on the "Get Directions" link on >> the page). I can't seem to get those "secondary" pages to work (the initial >> item goes through the pipelines without those properties, and I never see >> another item with those properties come through). >> >> class SiteSpider(Spider): >> site_loader = SiteLoader >> ... >> def parse(self, response): >> item = Place() >> sel = Selector(response) >> bl = self.site_loader(item=item, selector=sel) >> bl.add_value('domain', self.parent_domain) >> bl.add_value('origin', response.url) >> for place_property in item.fields: >> parse_xpath = template.get(place_property) >> >> # parse_xpath will look like either: >> # '//path/to/property/text()' >> # {'url': '//a[@id="Location"]/@href', 'xpath': >> '//div[@class="directions"]/span[contains(@class, "address")]/text()'} >> if isinstance(parse_xpath, dict): # if True, then this >> place_property >> is in another URL >> url = sel.xpath(parse_xpath['url_elem']).extract() >> yield Request(url, callback=self.get_url_property, meta={ >> 'loader': bl, 'parse_xpath': parse_xpath, 'place_property':place_property >> }) >> else: # process normally >> bl.add_xpath(event_property, template.get(event_property >> )) >> yield bl.load_item() >> >> >> def get_url_property(self, response): >> loader = response.meta['loader'] >> parse_xpath = response.meta['parse_xpath'] >> place_property = response.meta['place_property'] >> sel = Selector(response) >> loader.add_value(place_property, sel.xpath(parse_xpath['xpath']) >> return loader >> >> >> Basically, the part I'm confused about is where you see "yield Request". >> I only put it there to illustrate where the problem lies; I know that this >> will cause the item to get processed without the properties found at that >> Request. So in my example, if the Place().location property is located at >> another link on the page, I'd like to load that page and fill that property >> with the appropriate value. Even if a single loader can't do it, that's >> fine, maybe I can use loader.item or something. I don't know, that's pretty >> much where my Google trail has ended. >> >> Is what I want possible? I would prefer to keep the request asynchronous >> somehow, but if I really have to, making a synchronous request would >> suffice. Can someone kinda lead me in the right direction? I'd appreciate >> it. Thanks! >> >> -- >> Joey "JoeLinux" Espinosa >> >> <http://therealjoelinux.blogspot.com/><http://twitter.com/therealjoelinux><http://about.me/joelinux> >> > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to scrapy-users+unsubscr...@googlegroups.com. > To post to this group, send email to scrapy-users@googlegroups.com. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.