Re: Parse several scrapy's select()

Steven Almeroth Fri, 04 Mar 2016 20:49:45 -0800

> TypeError: Request url must be str or unicode, got HtmlXPathSelector:

Try:


  yield Request(url=time.extract()[0], callback=self.parse_article)



On Saturday, February 6, 2016 at 12:40:59 PM UTC-7, jaja...@gmail.com wrote:
>
> I seem to be having an issue with this spider.  I am new to python and to 
> scrapy, therefore, I might be missing some fundamentals and any direction 
> in this would really help.
> Below is the code so far. I'm sure I have quite a few errors and I keep 
> trying to fix them but I'm not getting too far.  I am trying to get the 
> spider to go into one page that has a table with the date, and article 
> title and link embedded in the title as you can see from the pic.  Them 
> once got the info from one row, go on to the next.
>
>
> <https://lh3.googleusercontent.com/-6iq4WuC8WqU/VrZBLRSxi0I/AAAAAAAAA8w/2XhPkMMUK8M/s1600/snipet.jpg>
> I figured that the best way to select the right sections was to use 
> scrapy's select() to dig deeper into the node as the date is in it's own 
> html class and the url and title are in another:
>
>
> <https://lh3.googleusercontent.com/-KoAf2zBZPRo/VrZChtH7HQI/AAAAAAAAA88/XUomt9qVNQU/s1600/html%2Bsnipet.jpg>
>
> So, I used the  "times = hxs.select('//td[@class="stime3"]')" to get the 
> date and  " sites = hxs.select('//td[@class="article"]')" to get the 
> title name and url.  
> from scrapy.spider import BaseSpider, Rule
> from scrapy.selector import HtmlXPathSelector
> from scrapy.selector import Selector
> from scrapy.contrib.linkextractors import LinkExtractor
> from dirbot.items import WebsiteLoader
> from scrapy.http import Request
> from scrapy.http import HtmlResponse
>
>
>
> class DindexSpider(BaseSpider):
>    name = "dindex"
>    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 
> (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"
>    allowed_domains = ["newslookup.com"]
>    start_urls = [
>          "http://www.newslookup.com/Business/";]
>   
>     rules = (
>        Rule(LinkExtractor(allow="newslookup.com/Business/"), callback=
> "parse", follow=True),
>    )
>     def parse(self, response):  
>                hxs = HtmlXPathSelector(response)
>               self.log("Scraping: " + response.url)
>           
>                 times = hxs.select('//td[@class="stime3"]')
>             for time in times:
>                      il = WebsiteLoader(response=response, selector=time)
>                    il.add_xpath('publish_date', 'text()')
>                  item = il.load_item()
>                   yield Request(url=time, callback=self.parse_article)
>                    
>    def parse_article(self, response):
>          hxs = HtmlXPathSelector(response)
>               self.log("scraping: " + response.url)
>           
>                 sites = hxs.select('//td[@class="article"]')
>            for site in sites:
>                      il = WebsiteLoader(response=response, selector=site)
>                    il.add_xpath('name', 'a/text()')
>                        il.add_xpath('url', 'a/@href')
>                  item = il.load_item()
>                   yield Request(url=times, callback=self.parse_item)
>                      
>                def parse_item(self, response):
>                 item = response.meta['item']
>                    yield il.load_item()
>  
> Now, I may have the logic completely wrong and I hope that someone can 
> lead me in the right direction... 
>
> One of the errors I get when I run it is:
>
> 2016-02-06 12:21:22 [scrapy] ERROR: Spider error processing <GET
> http://www.newslookup.com/Business/ 
> <http://www.google.com/url?q=http%3A%2F%2Fwww.newslookup.com%2FBusiness%2F&sa=D&sntz=1&usg=AFQjCNHCk5nRYIrqCk_XhA1LWks_lU0U6w>>
>  
> (referer: None)
> Traceback (most recent call last):
>  File 
> "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\utils\defer.py", 
> line 102, in iter_errback
>    yield next(it)
>  File 
> "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\offsite.py",
>  
> line 28, in process_spider_output
>    for x in result:
>  File 
> "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\referer.py",
>  
> line 22, in <genexpr>
>    return (_set_referer(r) for r in result or ())
>  File 
> "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\urllength.py",
>  
> line 37, in <genexpr>
>    return (r for r in result or () if _filter(r))
>  File 
> "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\depth.py",
>  
> line 54, in <genexpr>
>    return (r for r in result or () if _filter(r))
>  File 
> "C:\virtualenvs\[TextIndexer]\Scripts\example\dindex\dirbot-mysql\dirbot\spiders\dindex.py",
>  
> line 32, in parse
>    yield Request(url=time, callback=self.parse_article)
>  File 
> "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py",
>  
> line 24, in __init__
>    self._set_url(url)
>  File 
> "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py",
>  
> line 57, in _set_url
>    raise TypeError('Request url must be str or unicode, got %s:' % 
> type(url).__name__)
> TypeError: Request url must be str or unicode, got HtmlXPathSelector:
> 2016-02-06 12:21:22 [scrapy] INFO: Closing spider (finished)
>
>
> I am really not sure what's other than it's coming from line 32 when it 
> yields the request.
>
> Any help, direction would be greatly appreciated.
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Parse several scrapy's select()

Reply via email to