Re: Scrapy noob looking to make an (apparently not so) simple crawlspider to crawl news sites

Grant Basson Mon, 30 Mar 2015 06:45:00 -0700

OK, so I've done a bit more digging and I think I have a bit more 
information;


my new Spider:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from basic.items import BasicItem

class BasicSpiderSpider(CrawlSpider):
    name = "basic_spider"
    allowed_domains = ["news24.com/"]
    start_urls = (
    
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328'
,
    )
    
    rules = (Rule (SgmlLinkExtractor(allow=("", ))
    , callback="parse_items", follow= True),
    )
    def parse_items(self, response):
        hxs = Selector(response)
        titles = hxs.xpath('//*[@id="aspnetForm"]')
        items = []
        item = BasicItem()
        item['Headline'] = titles.xpath(
'//*[@id="article_special"]//h1/text()').extract()
        item["Article"] = titles.xpath('//*[@id="article-body"]/p[1]/text()'
).extract()
        item["Date"] = titles.xpath('//*[@id="spnDate"]/text()').extract()
        items.append(item)
        return items

I am still getting the same problem, though have noticed that there is a 
"[" for every time I try and run the spider, to try and figure out what the 
issue is I have run the following command:

c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items 
-d 2 -v 
http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328

which gives me the following output:

2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, 
http11
2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: 
{'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], 
'DEPTH_LIMIT': 1, 'DOW
NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, 
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: 
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
RetryMiddleware, D
efaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, 
RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, 
DownloaderStats
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: 
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
UrlLengthMiddleware, DepthMiddl
eware
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on 
127.0.0.1:6023
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on 
127.0.0.1:6080
2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET 
http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
>
 (referer: None)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 282,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 145301,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 177000),
         'log_count/DEBUG': 3,
         'log_count/INFO': 7,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)
>>> DEPTH LEVEL: 1 <<<
# Scraped Items  
------------------------------------------------------------
[{'Article': [u'Johannesburg - Fifty-six children were taken 
to\nPietermaritzburg hospitals after showing signs of food poisoning while 
at\nschool, KwaZulu-Na
tal emergency services said on Friday.'],
  'Date': [u'2015-03-28 07:30'],
  'Headline': [u'56 children hospitalised for food poisoning']}]
# Requests  
-----------------------------------------------------------------
[]

So, I can see that the Item is being scraped, but there is no usable item 
data put into the json file. this is how i'm running scrapy:

scrapy crawl basic_spider -o test.json

I've been looking at the last line, (return items) as changing it to either 
yield or print gives me no items scraped in the parse.

Any help at all would be much appreciated!


On Friday, 27 March 2015 11:45:46 UTC+2, Grant Basson wrote:

> Hi all,
>
> as the subject suggests I'm a complete noob at web scraping, I've done all 
> the usual googling, gone through the tutorial in the documentation, even 
> watched a few tutorials on youtube and have now come up against a wall.
>
> what i'm trying to achieve:
>
> essentially I am looking to crawl news sites, looking for a particular 
> search term, (or terms) and return; the link to the story, headline, first 
> paragraph of the actual article and the date the article was published, and 
> insert this into an mssql database. I've gotten it crawling a particular 
> site but can't even seem to get any output to look for search terms.
>
> what I've got so far:
>
> #--------import the required classes-----
>
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from scrapy.selector import HtmlXPathSelector
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> from news24.items import News24Item
>
>
> class News24SpiderSpider(CrawlSpider):
>     name = 'news24_spider'
>     allowed_domains = ['news24.com']
>     start_urls = ['http://www.news24.com/']
>
> #-------the news24.com doesn't seem to have many stories attached to it 
> directly
> #-------so I haven't defined the "allow" parameter for the rule
>
>     rules = (Rule (SgmlLinkExtractor(allow=("news24.com/", ))
>     , callback="parse_items", follow= True),
>     )  
>
> #-------the below i've copied from 
> #-----
> http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VRUgBHkcQqg
> #-----and changed appropriately
>
>     def parse_items(self, response):
>         hxs = HtmlXPathSelector(response)
>         headlines = hxs.xpath("/@html")
>         items = []
>         for headlines in headlines:
>             item = News24Item()
>             item ["Headline"] = 
> response.xpath('//*[@id="article_special"]//h1/text()').extract()
>             item ["Article"] = 
> response.xpath('//*[@id="article-body"]/p[1]/text()').extract()
>             item ["Date"] = 
> response.xpath('//*[@id="spnDate"]/text()').extract()
>             item ["Link"] = headlines.select("a/@href").extract()
>             items.append(item)
>         return(items)
>
>
> #-----end spider
>
> What I get when I run the spider, (scrapy crawl news24_spider -o 
> test.json) shows that it is indeed recursively scraping pages to a depth of 
> 2, (set in settings for testing purposes) and finding pages that SHOULD 
> meet the xpath requirements set out above. when I open test.json however 
> all I get is "[[[[[["
>
> any help is appreciated.
>
> Kind regards,
> Grant
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Scrapy noob looking to make an (apparently not so) simple crawlspider to crawl news sites

Reply via email to