Re: Process Multiple Requests For Single Item

Pablo Hoffman Fri, 18 Apr 2014 12:34:37 -0700

You shouldn't return/yield the item until it's complete. In other words,
you should return the item in the "get_url_property" callback, not the main
one.  Each item must be returned only once, and only once its data has been
fully populated.



On Wed, Mar 5, 2014 at 11:47 AM, Joey Espinosa <jlouis.espin...@gmail.com>wrote:

> HOLY TYPOS. Sorry. Revised:
>
> class SiteSpider(Spider):
>     site_loader = SiteLoader
>     ...
>     def parse(self, response):
>         item = Place()
>         sel = Selector(response)
>         bl = self.site_loader(item=item, selector=sel)
>         bl.add_value('domain', self.parent_domain)
>         bl.add_value('origin', response.url)
>         for place_property in item.fields:
>
>             parse_xpath = self.template.get(place_property)
>
>
>             # parse_xpath will look like either:
>             # '//path/to/property/text()'
>             # {'url': '//a[@id="Location"]/@href', 'xpath':
> '//div[@class="directions"]/span[contains(@class, "address")]/text()'}
>             if isinstance(parse_xpath, dict):    # if True, then this
> place_property is in another URL
>                 url = sel.xpath(parse_xpath['url_elem']).extract()
>                 yield Request(url, callback=self.get_url_property, meta={
> 'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property
> })
>             else:  # process normally
>
>                 bl.add_xpath(place_property, parse_xpath)
>
>         yield bl.load_item()
>
>     def get_url_property(self, response):
>         loader = response.meta['loader']
>         parse_xpath = response.meta['parse_xpath']
>         place_property = response.meta['place_property']
>         sel = Selector(response)
>         loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
>         return loader
>
>
> --
> Joey "JoeLinux" Espinosa
>  
> <http://therealjoelinux.blogspot.com/><http://twitter.com/therealjoelinux><http://about.me/joelinux>
>
> On Wednesday, March 5, 2014 8:41:12 AM UTC-5, Joey Espinosa wrote:
>>
>> Hey guys,
>>
>> Disclaimer: I'm new to this group, and fairly new to Scrapy as well (but
>> certainly not Python).
>>
>> Here is the issue I'm having. In my Scrapy project, I point to a page and
>> hopefully grab everything I need for the item. However, some domains (I'm
>> scraping a significant amount of separate domains) have certain item
>> properties located in another page within the initial page (for example,
>> "location" might only be found by clicking on the "Get Directions" link on
>> the page). I can't seem to get those "secondary" pages to work (the initial
>> item goes through the pipelines without those properties, and I never see
>> another item with those properties come through).
>>
>> class SiteSpider(Spider):
>>     site_loader = SiteLoader
>>     ...
>>     def parse(self, response):
>>         item = Place()
>>         sel = Selector(response)
>>         bl = self.site_loader(item=item, selector=sel)
>>         bl.add_value('domain', self.parent_domain)
>>         bl.add_value('origin', response.url)
>>         for place_property in item.fields:
>>             parse_xpath = template.get(place_property)
>>
>>             # parse_xpath will look like either:
>>             # '//path/to/property/text()'
>>             # {'url': '//a[@id="Location"]/@href', 'xpath':
>> '//div[@class="directions"]/span[contains(@class, "address")]/text()'}
>>             if isinstance(parse_xpath, dict):    # if True, then this 
>> place_property
>> is in another URL
>>                 url = sel.xpath(parse_xpath['url_elem']).extract()
>>                 yield Request(url, callback=self.get_url_property, meta={
>> 'loader': bl, 'parse_xpath': parse_xpath, 'place_property':place_property
>> })
>>             else:  # process normally
>>                 bl.add_xpath(event_property, template.get(event_property
>> ))
>>         yield bl.load_item()
>>
>>
>>     def get_url_property(self, response):
>>         loader = response.meta['loader']
>>         parse_xpath = response.meta['parse_xpath']
>>         place_property = response.meta['place_property']
>>         sel = Selector(response)
>>         loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
>>         return loader
>>
>>
>> Basically, the part I'm confused about is where you see "yield Request".
>> I only put it there to illustrate where the problem lies; I know that this
>> will cause the item to get processed without the properties found at that
>> Request. So in my example, if the Place().location property is located at
>> another link on the page, I'd like to load that page and fill that property
>> with the appropriate value. Even if a single loader can't do it, that's
>> fine, maybe I can use loader.item or something. I don't know, that's pretty
>> much where my Google trail has ended.
>>
>> Is what I want possible? I would prefer to keep the request asynchronous
>> somehow, but if I really have to, making a synchronous request would
>> suffice. Can someone kinda lead me in the right direction? I'd appreciate
>> it. Thanks!
>>
>> --
>> Joey "JoeLinux" Espinosa
>>  
>> <http://therealjoelinux.blogspot.com/><http://twitter.com/therealjoelinux><http://about.me/joelinux>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Process Multiple Requests For Single Item

Reply via email to