Thanks for you help, I had also posted the question to Stackoverflow and was shown that I needed to chop off the trailing / after the allowed domains. answer in full here:
http://stackoverflow.com/questions/29348425/scrapy-outputs-into-my-json-file Kind regards, Grant On Monday, 30 March 2015 17:41:17 UTC+2, Daniel Fockler wrote: > Yeah so you're on the right track, with Scrapy the idea is that your parse > item function gets passed a page full of content and you can 'yield' as > many items as you want from that page. So you can prepare your item in a > for loop and yield that item, then your code will continue the for loop and > yield another item. You don't want to yield an array of items because > scrapy processes each item yielded separately, and then puts it in an array > for you at the end. Ideally you would loop on an xpath like a table or > container and each of the items inside it would have a similar structure, > and will end up being your scraped items that you would then yield. > > On Monday, March 30, 2015 at 6:44:27 AM UTC-7, Grant Basson wrote: >> >> OK, so I've done a bit more digging and I think I have a bit more >> information; >> >> my new Spider: >> >> import scrapy >> from scrapy.contrib.spiders import CrawlSpider, Rule >> from scrapy.selector import Selector >> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >> from basic.items import BasicItem >> >> class BasicSpiderSpider(CrawlSpider): >> name = "basic_spider" >> allowed_domains = ["news24.com/"] >> start_urls = ( >> ' >> http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328 >> ', >> ) >> >> rules = (Rule (SgmlLinkExtractor(allow=("", )) >> , callback="parse_items", follow= True), >> ) >> def parse_items(self, response): >> hxs = Selector(response) >> titles = hxs.xpath('//*[@id="aspnetForm"]') >> items = [] >> item = BasicItem() >> item['Headline'] = titles.xpath( >> '//*[@id="article_special"]//h1/text()').extract() >> item["Article"] = titles.xpath( >> '//*[@id="article-body"]/p[1]/text()').extract() >> item["Date"] = titles.xpath('//*[@id="spnDate"]/text()').extract >> () >> items.append(item) >> return items >> >> I am still getting the same problem, though have noticed that there is a >> "[" for every time I try and run the spider, to try and figure out what the >> issue is I have run the following command: >> >> c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items >> -d 2 -v >> http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328 >> >> which gives me the following output: >> >> 2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic) >> 2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl, >> http11 >> 2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings: >> {'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'], >> 'DEPTH_LIMIT': 1, 'DOW >> NLOAD_DELAY': 2, 'BOT_NAME': 'basic'} >> 2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats, >> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >> 2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares: >> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >> RetryMiddleware, D >> efaultHeadersMiddleware, MetaRefreshMiddleware, >> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, >> ChunkedTransferMiddleware, DownloaderStats >> 2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares: >> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >> UrlLengthMiddleware, DepthMiddl >> eware >> 2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines: >> 2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened >> 2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on >> 127.0.0.1:6023 >> 2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on >> 127.0.0.1:6080 >> 2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET >> http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328 >> > >> (referer: None) >> 2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished) >> 2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats: >> {'downloader/request_bytes': 282, >> 'downloader/request_count': 1, >> 'downloader/request_method_count/GET': 1, >> 'downloader/response_bytes': 145301, >> 'downloader/response_count': 1, >> 'downloader/response_status_count/200': 1, >> 'finish_reason': 'finished', >> 'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, >> 177000), >> 'log_count/DEBUG': 3, >> 'log_count/INFO': 7, >> 'response_received_count': 1, >> 'scheduler/dequeued': 1, >> 'scheduler/dequeued/memory': 1, >> 'scheduler/enqueued': 1, >> 'scheduler/enqueued/memory': 1, >> 'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)} >> 2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished) >> >>> DEPTH LEVEL: 1 <<< >> # Scraped Items >> ------------------------------------------------------------ >> [{'Article': [u'Johannesburg - Fifty-six children were taken >> to\nPietermaritzburg hospitals after showing signs of food poisoning while >> at\nschool, KwaZulu-Na >> tal emergency services said on Friday.'], >> 'Date': [u'2015-03-28 07:30'], >> 'Headline': [u'56 children hospitalised for food poisoning']}] >> # Requests >> ----------------------------------------------------------------- >> [] >> >> So, I can see that the Item is being scraped, but there is no usable item >> data put into the json file. this is how i'm running scrapy: >> >> scrapy crawl basic_spider -o test.json >> >> I've been looking at the last line, (return items) as changing it to >> either yield or print gives me no items scraped in the parse. >> >> Any help at all would be much appreciated! >> >> >> On Friday, 27 March 2015 11:45:46 UTC+2, Grant Basson wrote: >> >>> Hi all, >>> >>> as the subject suggests I'm a complete noob at web scraping, I've done >>> all the usual googling, gone through the tutorial in the documentation, >>> even watched a few tutorials on youtube and have now come up against a wall. >>> >>> what i'm trying to achieve: >>> >>> essentially I am looking to crawl news sites, looking for a particular >>> search term, (or terms) and return; the link to the story, headline, first >>> paragraph of the actual article and the date the article was published, and >>> insert this into an mssql database. I've gotten it crawling a particular >>> site but can't even seem to get any output to look for search terms. >>> >>> what I've got so far: >>> >>> #--------import the required classes----- >>> >>> from scrapy.contrib.spiders import CrawlSpider, Rule >>> from scrapy.selector import HtmlXPathSelector >>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >>> from news24.items import News24Item >>> >>> >>> class News24SpiderSpider(CrawlSpider): >>> name = 'news24_spider' >>> allowed_domains = ['news24.com'] >>> start_urls = ['http://www.news24.com/'] >>> >>> #-------the news24.com doesn't seem to have many stories attached to it >>> directly >>> #-------so I haven't defined the "allow" parameter for the rule >>> >>> rules = (Rule (SgmlLinkExtractor(allow=("news24.com/", )) >>> , callback="parse_items", follow= True), >>> ) >>> >>> #-------the below i've copied from >>> #----- >>> http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VRUgBHkcQqg >>> #-----and changed appropriately >>> >>> def parse_items(self, response): >>> hxs = HtmlXPathSelector(response) >>> headlines = hxs.xpath("/@html") >>> items = [] >>> for headlines in headlines: >>> item = News24Item() >>> item ["Headline"] = >>> response.xpath('//*[@id="article_special"]//h1/text()').extract() >>> item ["Article"] = >>> response.xpath('//*[@id="article-body"]/p[1]/text()').extract() >>> item ["Date"] = >>> response.xpath('//*[@id="spnDate"]/text()').extract() >>> item ["Link"] = headlines.select("a...@href").extract() >>> items.append(item) >>> return(items) >>> >>> >>> #-----end spider >>> >>> What I get when I run the spider, (scrapy crawl news24_spider -o >>> test.json) shows that it is indeed recursively scraping pages to a depth of >>> 2, (set in settings for testing purposes) and finding pages that SHOULD >>> meet the xpath requirements set out above. when I open test.json however >>> all I get is "[[[[[[" >>> >>> any help is appreciated. >>> >>> Kind regards, >>> Grant >>> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
