I seem to be having an issue with this spider. I am new to python and to scrapy, therefore, I might be missing some fundamentals and any direction in this would really help. Below is the code so far. I'm sure I have quite a few errors and I keep trying to fix them but I'm not getting too far. I am trying to get the spider to go into one page that has a table with the date, and article title and link embedded in the title as you can see from the pic. Them once got the info from one row, go on to the next.
<https://lh3.googleusercontent.com/-6iq4WuC8WqU/VrZBLRSxi0I/AAAAAAAAA8w/2XhPkMMUK8M/s1600/snipet.jpg> I figured that the best way to select the right sections was to use scrapy's select() to dig deeper into the node as the date is in it's own html class and the url and title are in another: <https://lh3.googleusercontent.com/-KoAf2zBZPRo/VrZChtH7HQI/AAAAAAAAA88/XUomt9qVNQU/s1600/html%2Bsnipet.jpg> So, I used the "times = hxs.select('//td[@class="stime3"]')" to get the date and " sites = hxs.select('//td[@class="article"]')" to get the title name and url. from scrapy.spider import BaseSpider, Rule from scrapy.selector import HtmlXPathSelector from scrapy.selector import Selector from scrapy.contrib.linkextractors import LinkExtractor from dirbot.items import WebsiteLoader from scrapy.http import Request from scrapy.http import HtmlResponse class DindexSpider(BaseSpider): name = "dindex" USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36" allowed_domains = ["newslookup.com"] start_urls = [ "http://www.newslookup.com/Business/"] rules = ( Rule(LinkExtractor(allow="newslookup.com/Business/"), callback= "parse", follow=True), ) def parse(self, response): hxs = HtmlXPathSelector(response) self.log("Scraping: " + response.url) times = hxs.select('//td[@class="stime3"]') for time in times: il = WebsiteLoader(response=response, selector=time) il.add_xpath('publish_date', 'text()') item = il.load_item() yield Request(url=time, callback=self.parse_article) def parse_article(self, response): hxs = HtmlXPathSelector(response) self.log("scraping: " + response.url) sites = hxs.select('//td[@class="article"]') for site in sites: il = WebsiteLoader(response=response, selector=site) il.add_xpath('name', 'a/text()') il.add_xpath('url', 'a/@href') item = il.load_item() yield Request(url=times, callback=self.parse_item) def parse_item(self, response): item = response.meta['item'] yield il.load_item() Now, I may have the logic completely wrong and I hope that someone can lead me in the right direction... One of the errors I get when I run it is: 2016-02-06 12:21:22 [scrapy] ERROR: Spider error processing <GET http://www.newslookup.com/Business/ <http://www.google.com/url?q=http%3A%2F%2Fwww.newslookup.com%2FBusiness%2F&sa=D&sntz=1&usg=AFQjCNHCk5nRYIrqCk_XhA1LWks_lU0U6w>> (referer: None) Traceback (most recent call last): File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\offsite.py", line 28, in process_spider_output for x in result: File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\depth.py", line 54, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\virtualenvs\[TextIndexer]\Scripts\example\dindex\dirbot-mysql\dirbot\spiders\dindex.py", line 32, in parse yield Request(url=time, callback=self.parse_article) File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py", line 24, in __init__ self._set_url(url) File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py", line 57, in _set_url raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__) TypeError: Request url must be str or unicode, got HtmlXPathSelector: 2016-02-06 12:21:22 [scrapy] INFO: Closing spider (finished) I am really not sure what's other than it's coming from line 32 when it yields the request. Any help, direction would be greatly appreciated. Thanks -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.