OK, so I've done a bit more digging and I think I have a bit more
information;
my new Spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from basic.items import BasicItem
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com/"]
start_urls = (
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328'
,
)
rules = (Rule (SgmlLinkExtractor(allow=("", ))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = Selector(response)
titles = hxs.xpath('//*[@id="aspnetForm"]')
items = []
item = BasicItem()
item['Headline'] = titles.xpath(
'//*[@id="article_special"]//h1/text()').extract()
item["Article"] = titles.xpath('//*[@id="article-body"]/p[1]/text()'
).extract()
item["Date"] = titles.xpath('//*[@id="spnDate"]/text()').extract()
items.append(item)
return items
I am still getting the same problem, though have noticed that there is a
"[" for every time I try and run the spider, to try and figure out what the
issue is I have run the following command:
c:\Scrapy Spiders\basic>scrapy parse --spider=basic_spider -c parse_items
-d 2 -v
http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
which gives me the following output:
2015-03-30 15:28:21+0200 [scrapy] INFO: Scrapy 0.24.5 started (bot: basic)
2015-03-30 15:28:21+0200 [scrapy] INFO: Optional features available: ssl,
http11
2015-03-30 15:28:21+0200 [scrapy] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'basic.spiders', 'SPIDER_MODULES': ['basic.spiders'],
'DEPTH_LIMIT': 1, 'DOW
NLOAD_DELAY': 2, 'BOT_NAME': 'basic'}
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled extensions: LogStats,
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
RetryMiddleware, D
efaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware,
RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware,
DownloaderStats
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
UrlLengthMiddleware, DepthMiddl
eware
2015-03-30 15:28:21+0200 [scrapy] INFO: Enabled item pipelines:
2015-03-30 15:28:21+0200 [basic_spider] INFO: Spider opened
2015-03-30 15:28:21+0200 [basic_spider] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Telnet console listening on
127.0.0.1:6023
2015-03-30 15:28:21+0200 [scrapy] DEBUG: Web service listening on
127.0.0.1:6080
2015-03-30 15:28:22+0200 [basic_spider] DEBUG: Crawled (200) <GET
http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328
>
(referer: None)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Closing spider (finished)
2015-03-30 15:28:22+0200 [basic_spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 145301,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 30, 13, 28, 22, 177000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 3, 30, 13, 28, 21, 878000)}
2015-03-30 15:28:22+0200 [basic_spider] INFO: Spider closed (finished)
>>> DEPTH LEVEL: 1 <<<
# Scraped Items
------------------------------------------------------------
[{'Article': [u'Johannesburg - Fifty-six children were taken
to\nPietermaritzburg hospitals after showing signs of food poisoning while
at\nschool, KwaZulu-Na
tal emergency services said on Friday.'],
'Date': [u'2015-03-28 07:30'],
'Headline': [u'56 children hospitalised for food poisoning']}]
# Requests
-----------------------------------------------------------------
[]
So, I can see that the Item is being scraped, but there is no usable item
data put into the json file. this is how i'm running scrapy:
scrapy crawl basic_spider -o test.json
I've been looking at the last line, (return items) as changing it to either
yield or print gives me no items scraped in the parse.
Any help at all would be much appreciated!
On Friday, 27 March 2015 11:45:46 UTC+2, Grant Basson wrote:
> Hi all,
>
> as the subject suggests I'm a complete noob at web scraping, I've done all
> the usual googling, gone through the tutorial in the documentation, even
> watched a few tutorials on youtube and have now come up against a wall.
>
> what i'm trying to achieve:
>
> essentially I am looking to crawl news sites, looking for a particular
> search term, (or terms) and return; the link to the story, headline, first
> paragraph of the actual article and the date the article was published, and
> insert this into an mssql database. I've gotten it crawling a particular
> site but can't even seem to get any output to look for search terms.
>
> what I've got so far:
>
> #--------import the required classes-----
>
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from scrapy.selector import HtmlXPathSelector
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> from news24.items import News24Item
>
>
> class News24SpiderSpider(CrawlSpider):
> name = 'news24_spider'
> allowed_domains = ['news24.com']
> start_urls = ['http://www.news24.com/']
>
> #-------the news24.com doesn't seem to have many stories attached to it
> directly
> #-------so I haven't defined the "allow" parameter for the rule
>
> rules = (Rule (SgmlLinkExtractor(allow=("news24.com/", ))
> , callback="parse_items", follow= True),
> )
>
> #-------the below i've copied from
> #-----
> http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VRUgBHkcQqg
> #-----and changed appropriately
>
> def parse_items(self, response):
> hxs = HtmlXPathSelector(response)
> headlines = hxs.xpath("/@html")
> items = []
> for headlines in headlines:
> item = News24Item()
> item ["Headline"] =
> response.xpath('//*[@id="article_special"]//h1/text()').extract()
> item ["Article"] =
> response.xpath('//*[@id="article-body"]/p[1]/text()').extract()
> item ["Date"] =
> response.xpath('//*[@id="spnDate"]/text()').extract()
> item ["Link"] = headlines.select("a/@href").extract()
> items.append(item)
> return(items)
>
>
> #-----end spider
>
> What I get when I run the spider, (scrapy crawl news24_spider -o
> test.json) shows that it is indeed recursively scraping pages to a depth of
> 2, (set in settings for testing purposes) and finding pages that SHOULD
> meet the xpath requirements set out above. when I open test.json however
> all I get is "[[[[[["
>
> any help is appreciated.
>
> Kind regards,
> Grant
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.