Parse several scrapy's select()

jajana21 Sat, 06 Feb 2016 11:41:14 -0800

I seem to be having an issue with this spider.  I am new to python and to 
scrapy, therefore, I might be missing some fundamentals and any direction 
in this would really help.
Below is the code so far. I'm sure I have quite a few errors and I keep 
trying to fix them but I'm not getting too far.  I am trying to get the 
spider to go into one page that has a table with the date, and article 
title and link embedded in the title as you can see from the pic.  Them 
once got the info from one row, go on to the next.


<https://lh3.googleusercontent.com/-6iq4WuC8WqU/VrZBLRSxi0I/AAAAAAAAA8w/2XhPkMMUK8M/s1600/snipet.jpg>
I figured that the best way to select the right sections was to use 
scrapy's select() to dig deeper into the node as the date is in it's own 
html class and the url and title are in another:

<https://lh3.googleusercontent.com/-KoAf2zBZPRo/VrZChtH7HQI/AAAAAAAAA88/XUomt9qVNQU/s1600/html%2Bsnipet.jpg>

So, I used the  "times = hxs.select('//td[@class="stime3"]')" to get the 
date and  " sites = hxs.select('//td[@class="article"]')" to get the title 
name and url.  
from scrapy.spider import BaseSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.linkextractors import LinkExtractor
from dirbot.items import WebsiteLoader
from scrapy.http import Request
from scrapy.http import HtmlResponse



class DindexSpider(BaseSpider):
   name = "dindex"
   USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"
   allowed_domains = ["newslookup.com"]
   start_urls = [
         "http://www.newslookup.com/Business/";]
  
    rules = (
       Rule(LinkExtractor(allow="newslookup.com/Business/"), callback=
"parse", follow=True),
   )
    def parse(self, response):  
               hxs = HtmlXPathSelector(response)
              self.log("Scraping: " + response.url)
          
                times = hxs.select('//td[@class="stime3"]')
            for time in times:
                     il = WebsiteLoader(response=response, selector=time)
                   il.add_xpath('publish_date', 'text()')
                 item = il.load_item()
                  yield Request(url=time, callback=self.parse_article)
                   
   def parse_article(self, response):
         hxs = HtmlXPathSelector(response)
              self.log("scraping: " + response.url)
          
                sites = hxs.select('//td[@class="article"]')
           for site in sites:
                     il = WebsiteLoader(response=response, selector=site)
                   il.add_xpath('name', 'a/text()')
                       il.add_xpath('url', 'a/@href')
                 item = il.load_item()
                  yield Request(url=times, callback=self.parse_item)
                     
               def parse_item(self, response):
                item = response.meta['item']
                   yield il.load_item()
 
Now, I may have the logic completely wrong and I hope that someone can lead 
me in the right direction... 

One of the errors I get when I run it is:

2016-02-06 12:21:22 [scrapy] ERROR: Spider error processing <GET
http://www.newslookup.com/Business/ 
<http://www.google.com/url?q=http%3A%2F%2Fwww.newslookup.com%2FBusiness%2F&sa=D&sntz=1&usg=AFQjCNHCk5nRYIrqCk_XhA1LWks_lU0U6w>>
 
(referer: None)
Traceback (most recent call last):
 File 
"C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\utils\defer.py", 
line 102, in iter_errback
   yield next(it)
 File 
"C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\offsite.py",
 
line 28, in process_spider_output
   for x in result:
 File 
"C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\referer.py",
 
line 22, in <genexpr>
   return (_set_referer(r) for r in result or ())
 File 
"C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\urllength.py",
 
line 37, in <genexpr>
   return (r for r in result or () if _filter(r))
 File 
"C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\depth.py",
 
line 54, in <genexpr>
   return (r for r in result or () if _filter(r))
 File 
"C:\virtualenvs\[TextIndexer]\Scripts\example\dindex\dirbot-mysql\dirbot\spiders\dindex.py",
 
line 32, in parse
   yield Request(url=time, callback=self.parse_article)
 File 
"C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py",
 
line 24, in __init__
   self._set_url(url)
 File 
"C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py",
 
line 57, in _set_url
   raise TypeError('Request url must be str or unicode, got %s:' % 
type(url).__name__)
TypeError: Request url must be str or unicode, got HtmlXPathSelector:
2016-02-06 12:21:22 [scrapy] INFO: Closing spider (finished)


I am really not sure what's other than it's coming from line 32 when it 
yields the request.

Any help, direction would be greatly appreciated.

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Parse several scrapy's select()

Reply via email to