> TypeError: Request url must be str or unicode, got HtmlXPathSelector: Try:
yield Request(url=time.extract()[0], callback=self.parse_article) On Saturday, February 6, 2016 at 12:40:59 PM UTC-7, jaja...@gmail.com wrote: > > I seem to be having an issue with this spider. I am new to python and to > scrapy, therefore, I might be missing some fundamentals and any direction > in this would really help. > Below is the code so far. I'm sure I have quite a few errors and I keep > trying to fix them but I'm not getting too far. I am trying to get the > spider to go into one page that has a table with the date, and article > title and link embedded in the title as you can see from the pic. Them > once got the info from one row, go on to the next. > > > <https://lh3.googleusercontent.com/-6iq4WuC8WqU/VrZBLRSxi0I/AAAAAAAAA8w/2XhPkMMUK8M/s1600/snipet.jpg> > I figured that the best way to select the right sections was to use > scrapy's select() to dig deeper into the node as the date is in it's own > html class and the url and title are in another: > > > <https://lh3.googleusercontent.com/-KoAf2zBZPRo/VrZChtH7HQI/AAAAAAAAA88/XUomt9qVNQU/s1600/html%2Bsnipet.jpg> > > So, I used the "times = hxs.select('//td[@class="stime3"]')" to get the > date and " sites = hxs.select('//td[@class="article"]')" to get the > title name and url. > from scrapy.spider import BaseSpider, Rule > from scrapy.selector import HtmlXPathSelector > from scrapy.selector import Selector > from scrapy.contrib.linkextractors import LinkExtractor > from dirbot.items import WebsiteLoader > from scrapy.http import Request > from scrapy.http import HtmlResponse > > > > class DindexSpider(BaseSpider): > name = "dindex" > USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 > (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36" > allowed_domains = ["newslookup.com"] > start_urls = [ > "http://www.newslookup.com/Business/"] > > rules = ( > Rule(LinkExtractor(allow="newslookup.com/Business/"), callback= > "parse", follow=True), > ) > def parse(self, response): > hxs = HtmlXPathSelector(response) > self.log("Scraping: " + response.url) > > times = hxs.select('//td[@class="stime3"]') > for time in times: > il = WebsiteLoader(response=response, selector=time) > il.add_xpath('publish_date', 'text()') > item = il.load_item() > yield Request(url=time, callback=self.parse_article) > > def parse_article(self, response): > hxs = HtmlXPathSelector(response) > self.log("scraping: " + response.url) > > sites = hxs.select('//td[@class="article"]') > for site in sites: > il = WebsiteLoader(response=response, selector=site) > il.add_xpath('name', 'a/text()') > il.add_xpath('url', 'a/@href') > item = il.load_item() > yield Request(url=times, callback=self.parse_item) > > def parse_item(self, response): > item = response.meta['item'] > yield il.load_item() > > Now, I may have the logic completely wrong and I hope that someone can > lead me in the right direction... > > One of the errors I get when I run it is: > > 2016-02-06 12:21:22 [scrapy] ERROR: Spider error processing <GET > http://www.newslookup.com/Business/ > <http://www.google.com/url?q=http%3A%2F%2Fwww.newslookup.com%2FBusiness%2F&sa=D&sntz=1&usg=AFQjCNHCk5nRYIrqCk_XhA1LWks_lU0U6w>> > > (referer: None) > Traceback (most recent call last): > File > "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\utils\defer.py", > line 102, in iter_errback > yield next(it) > File > "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\offsite.py", > > line 28, in process_spider_output > for x in result: > File > "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\referer.py", > > line 22, in <genexpr> > return (_set_referer(r) for r in result or ()) > File > "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\urllength.py", > > line 37, in <genexpr> > return (r for r in result or () if _filter(r)) > File > "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\depth.py", > > line 54, in <genexpr> > return (r for r in result or () if _filter(r)) > File > "C:\virtualenvs\[TextIndexer]\Scripts\example\dindex\dirbot-mysql\dirbot\spiders\dindex.py", > > line 32, in parse > yield Request(url=time, callback=self.parse_article) > File > "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py", > > line 24, in __init__ > self._set_url(url) > File > "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py", > > line 57, in _set_url > raise TypeError('Request url must be str or unicode, got %s:' % > type(url).__name__) > TypeError: Request url must be str or unicode, got HtmlXPathSelector: > 2016-02-06 12:21:22 [scrapy] INFO: Closing spider (finished) > > > I am really not sure what's other than it's coming from line 32 when it > yields the request. > > Any help, direction would be greatly appreciated. > > Thanks > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.