Re: Scrapy noob looking to make an (apparently not so) simple crawlspider to crawl news sites

Grant Basson Mon, 30 Mar 2015 23:16:33 -0700

Thanks for you help, I had also posted the question to Stackoverflow and 
was shown that I needed to chop off the trailing / after the allowed 
domains. answer in full here:


http://stackoverflow.com/questions/29348425/scrapy-outputs-into-my-json-file

Kind regards,
Grant

On Monday, 30 March 2015 17:41:17 UTC+2, Daniel Fockler wrote:

> Yeah so you're on the right track, with Scrapy the idea is that your parse 
> item function gets passed a page full of content and you can 'yield' as 
> many items as you want from that page. So you can prepare your item in a 
> for loop and yield that item, then your code will continue the for loop and 
> yield another item. You don't want to yield an array of items because 
> scrapy processes each item yielded separately, and then puts it in an array 
> for you at the end. Ideally you would loop on an xpath like a table or 
> container and each of the items inside it would have a similar structure, 
> and will end up being your scraped items that you would then yield.
>
> On Monday, March 30, 2015 at 6:44:27 AM UTC-7, Grant Basson wrote:
>>
>> OK, so I've done a bit more digging and I think I have a bit more 
>> information;
>>
>> my new Spider:
>>
>> import scrapy
>> from scrapy.contrib.spiders import CrawlSpider, Rule
>> from scrapy.selector import Selector
>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>> from basic.items import BasicItem
>>
>> class BasicSpiderSpider(CrawlSpider):
>>     name = "basic_spider"
>>     allowed_domains = ["news24.com/"]
>>     start_urls = (
>>     '
>> http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
>> ',
>>     )
>>     
>>     rules = (Rule (SgmlLinkExtractor(allow=("", ))
>>     , callback="parse_items", follow= True),
>>     )
>>     def parse_items(self, response):
>>         hxs = Selector(response)
>>         titles = hxs.xpath('//*[@id="aspnetForm"]')
>>         items = []
>>         item = BasicItem()
>>         item['Headline'] = titles.xpath(
>> '//*[@id="article_special"]//h1/text()').extract()
>>         item["Article"] = titles.xpath(
>> '//*[@id="article-body"]/p[1]/text()').extract()
>>         item["Date"] = titles.xpath('//*[@id="spnDate"]/text()').extract
>> ()
>>         items.append(item)
>>         return items
>>
>> I am still getting the same problem, though have noticed that there is a 
>> "[" for every time I try and run the spider, to try and figure out what the 
>> issue is I have run the following command:
>>
>> c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items 
>> -d 2 -v 
>> http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
>>
>> which gives me the following output:
>>
>> 2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
>> 2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, 
>> http11
>> 2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: 
>> {'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], 
>> 'DEPTH_LIMIT': 1, 'DOW
>> NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
>> 2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, 
>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>> 2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: 
>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
>> RetryMiddleware, D
>> efaultHeadersMiddleware, MetaRefreshMiddleware, 
>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
>> ChunkedTransferMiddleware, DownloaderStats
>> 2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: 
>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>> UrlLengthMiddleware, DepthMiddl
>> eware
>> 2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
>> 2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
>> 2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on 
>> 127.0.0.1:6023
>> 2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on 
>> 127.0.0.1:6080
>> 2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET 
>> http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
>> >
>>  (referer: None)
>> 2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
>> 2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
>>         {'downloader/request_bytes': 282,
>>          'downloader/request_count': 1,
>>          'downloader/request_method_count/GET': 1,
>>          'downloader/response_bytes': 145301,
>>          'downloader/response_count': 1,
>>          'downloader/response_status_count/200': 1,
>>          'finish_reason': 'finished',
>>          'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 
>> 177000),
>>          'log_count/DEBUG': 3,
>>          'log_count/INFO': 7,
>>          'response_received_count': 1,
>>          'scheduler/dequeued': 1,
>>          'scheduler/dequeued/memory': 1,
>>          'scheduler/enqueued': 1,
>>          'scheduler/enqueued/memory': 1,
>>          'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
>> 2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)
>> >>> DEPTH LEVEL: 1 <<<
>> # Scraped Items  
>> ------------------------------------------------------------
>> [{'Article': [u'Johannesburg - Fifty-six children were taken 
>> to\nPietermaritzburg hospitals after showing signs of food poisoning while 
>> at\nschool, KwaZulu-Na
>> tal emergency services said on Friday.'],
>>   'Date': [u'2015-03-28 07:30'],
>>   'Headline': [u'56 children hospitalised for food poisoning']}]
>> # Requests  
>> -----------------------------------------------------------------
>> []
>>
>> So, I can see that the Item is being scraped, but there is no usable item 
>> data put into the json file. this is how i'm running scrapy:
>>
>> scrapy crawl basic_spider -o test.json
>>
>> I've been looking at the last line, (return items) as changing it to 
>> either yield or print gives me no items scraped in the parse.
>>
>> Any help at all would be much appreciated!
>>
>>
>> On Friday, 27 March 2015 11:45:46 UTC+2, Grant Basson wrote:
>>
>>> Hi all,
>>>
>>> as the subject suggests I'm a complete noob at web scraping, I've done 
>>> all the usual googling, gone through the tutorial in the documentation, 
>>> even watched a few tutorials on youtube and have now come up against a wall.
>>>
>>> what i'm trying to achieve:
>>>
>>> essentially I am looking to crawl news sites, looking for a particular 
>>> search term, (or terms) and return; the link to the story, headline, first 
>>> paragraph of the actual article and the date the article was published, and 
>>> insert this into an mssql database. I've gotten it crawling a particular 
>>> site but can't even seem to get any output to look for search terms.
>>>
>>> what I've got so far:
>>>
>>> #--------import the required classes-----
>>>
>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>> from scrapy.selector import HtmlXPathSelector
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> from news24.items import News24Item
>>>
>>>
>>> class News24SpiderSpider(CrawlSpider):
>>>     name = 'news24_spider'
>>>     allowed_domains = ['news24.com']
>>>     start_urls = ['http://www.news24.com/']
>>>
>>> #-------the news24.com doesn't seem to have many stories attached to it 
>>> directly
>>> #-------so I haven't defined the "allow" parameter for the rule
>>>
>>>     rules = (Rule (SgmlLinkExtractor(allow=("news24.com/", ))
>>>     , callback="parse_items", follow= True),
>>>     )  
>>>
>>> #-------the below i've copied from 
>>> #-----
>>> http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VRUgBHkcQqg
>>> #-----and changed appropriately
>>>
>>>     def parse_items(self, response):
>>>         hxs = HtmlXPathSelector(response)
>>>         headlines = hxs.xpath("/@html")
>>>         items = []
>>>         for headlines in headlines:
>>>             item = News24Item()
>>>             item ["Headline"] = 
>>> response.xpath('//*[@id="article_special"]//h1/text()').extract()
>>>             item ["Article"] = 
>>> response.xpath('//*[@id="article-body"]/p[1]/text()').extract()
>>>             item ["Date"] = 
>>> response.xpath('//*[@id="spnDate"]/text()').extract()
>>>             item ["Link"] = headlines.select("a...@href").extract()
>>>             items.append(item)
>>>         return(items)
>>>
>>>
>>> #-----end spider
>>>
>>> What I get when I run the spider, (scrapy crawl news24_spider -o 
>>> test.json) shows that it is indeed recursively scraping pages to a depth of 
>>> 2, (set in settings for testing purposes) and finding pages that SHOULD 
>>> meet the xpath requirements set out above. when I open test.json however 
>>> all I get is "[[[[[["
>>>
>>> any help is appreciated.
>>>
>>> Kind regards,
>>> Grant
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Scrapy noob looking to make an (apparently not so) simple crawlspider to crawl news sites

Reply via email to