Hi Marco, when you BaseSpider, you should define the parse callback to process the response for URLs in start_urls Otherwise you get this NotImplementedError https://github.com/scrapy/scrapy/blob/master/scrapy/spider.py#L55
/Paul. On Friday, February 7, 2014 5:07:31 PM UTC+1, Marco Ippolito wrote: > > Hi everybody, > > through scrapy shell: > scrapy shell http://www.ilsole24ore.com/ > hxs.select('//a[contains(@href, "http")]/@href').extract() > Out[1]: > [u' > http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml > ', > u' > http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml > ', > u'http://www.ilsole24ore.com/cultura.shtml', > u'http://www.casa24.ilsole24ore.com/', > u'http://www.moda24.ilsole24ore.com/', > u'http://food24.ilsole24ore.com/', > u'http://www.motori24.ilsole24ore.com/', > u'http://job24.ilsole24ore.com/', > u'http://stream24.ilsole24ore.com/', > u'http://www.viaggi24.ilsole24ore.com/', > u'http://www.salute24.ilsole24ore.com/', > u'http://www.shopping24.ilsole24ore.com/', > ..... > > but it doesn't work outside scrapy shell: > > items.py: > > from scrapy.item import Item, Field > > class Sole24OreItem(Item): > url = Field() > pass > > sole.py: > > from scrapy.spider import BaseSpider > from scrapy.selector import HtmlXPathSelector > from sole24ore.items import Sole24OreItem > > class SoleSpider(BaseSpider): > name = 'sole' > allowed_domains = ['sole24ore.com'] > start_urls = ['http://www.sole24ore.com/'] > > def parse_item(self, response): > > hxs = HtmlXPathSelector(response) > item = Sole24OreItem() > url = hxs.select('//a[contains(@href, "http")]/@href').extract() > item['url'] = url > > return item > > SPIDER = SoleSpider() > > sole24ore]$scrapy crawl sole > 2014-02-07 16:04:41+0000 [scrapy] INFO: Scrapy 0.18.4 started (bot: > sole24ore) > 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Optional features available: ssl, > http11, boto > 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Overridden settings: > {'NEWSPIDER_MODULE': 'sole24ore.spiders', 'SPIDER_MODULES': > ['sole24ore.spiders'], 'BOT_NAME': 'sole24ore'} > 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled extensions: LogStats, > TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState > 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled downloader middlewares: > HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, > RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, > HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, > ChunkedTransferMiddleware, DownloaderStats > 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled spider middlewares: > HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, > UrlLengthMiddleware, DepthMiddleware > 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled item pipelines: > 2014-02-07 16:04:41+0000 [sole] INFO: Spider opened > 2014-02-07 16:04:41+0000 [sole] INFO: Crawled 0 pages (at 0 pages/min), > scraped 0 items (at 0 items/min) > 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Telnet console listening on > 0.0.0.0:6024 > 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Web service listening on > 0.0.0.0:6081 > 2014-02-07 16:04:41+0000 [sole] DEBUG: Redirecting (301) to <GET > http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/> > 2014-02-07 16:04:41+0000 [sole] DEBUG: Crawled (200) <GET > http://www.ilsole24ore.com/> (referer: None) > 2014-02-07 16:04:41+0000 [sole] ERROR: Spider error processing <GET > http://www.ilsole24ore.com/> > Traceback (most recent call last): > File > "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in > mainLoop > self.runUntilCurrent() > File > "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in > runUntilCurrent > call.func(*call.args, **call.kw) > File > "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in > callback > self._startRunCallbacks(result) > File > "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in > _startRunCallbacks > self._runCallbacks() > --- <exception caught here> --- > File > "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in > _runCallbacks > current.result = callback(current.result, *args, **kw) > File "/usr/lib/pymodules/python2.7/scrapy/spider.py", line 57, > in parse > raise NotImplementedError > exceptions.NotImplementedError: > > 2014-02-07 16:04:41+0000 [sole] INFO: Closing spider (finished) > 2014-02-07 16:04:41+0000 [sole] INFO: Dumping Scrapy stats: > {'downloader/request_bytes': 448, > 'downloader/request_count': 2, > 'downloader/request_method_count/GET': 2, > 'downloader/response_bytes': 47635, > 'downloader/response_count': 2, > 'downloader/response_status_count/200': 1, > 'downloader/response_status_count/301': 1, > 'finish_reason': 'finished', > 'finish_time': datetime.datetime(2014, 2, 7, 16, 4, 41, 585750), > 'log_count/DEBUG': 8, > 'log_count/ERROR': 1, > 'log_count/INFO': 3, > 'response_received_count': 1, > 'scheduler/dequeued': 2, > 'scheduler/dequeued/memory': 2, > 'scheduler/enqueued': 2, > 'scheduler/enqueued/memory': 2, > 'spider_exceptions/NotImplementedError': 1, > 'start_time': datetime.datetime(2014, 2, 7, 16, 4, 41, 240417)} > 2014-02-07 16:04:41+0000 [sole] INFO: Spider closed (finished) > > > Any hints to help me? > > Thank you very much. > Kind regards. > Marco > > > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
