Hi Paul, thank you very much for your kind prompt helpful hint!!!..it works.
I learned one more thing today: "the devil is in the details". Kind regards. Marco On Friday, 7 February 2014 17:13:07 UTC+1, Paul Tremberth wrote: > > ...meaning if you rename your "parse_item" method to "parse" you should be > good > > On Friday, February 7, 2014 5:12:01 PM UTC+1, Paul Tremberth wrote: >> >> Hi Marco, >> >> when you BaseSpider, you should define the parse callback to process the >> response for URLs in start_urls >> Otherwise you get this NotImplementedError >> https://github.com/scrapy/scrapy/blob/master/scrapy/spider.py#L55 >> >> /Paul. >> >> On Friday, February 7, 2014 5:07:31 PM UTC+1, Marco Ippolito wrote: >>> >>> Hi everybody, >>> >>> through scrapy shell: >>> scrapy shell http://www.ilsole24ore.com/ >>> hxs.select('//a[contains(@href, "http")]/@href').extract() >>> Out[1]: >>> [u' >>> http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml >>> ', >>> u' >>> http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml >>> ', >>> u'http://www.ilsole24ore.com/cultura.shtml', >>> u'http://www.casa24.ilsole24ore.com/', >>> u'http://www.moda24.ilsole24ore.com/', >>> u'http://food24.ilsole24ore.com/', >>> u'http://www.motori24.ilsole24ore.com/', >>> u'http://job24.ilsole24ore.com/', >>> u'http://stream24.ilsole24ore.com/', >>> u'http://www.viaggi24.ilsole24ore.com/', >>> u'http://www.salute24.ilsole24ore.com/', >>> u'http://www.shopping24.ilsole24ore.com/', >>> ..... >>> >>> but it doesn't work outside scrapy shell: >>> >>> items.py: >>> >>> from scrapy.item import Item, Field >>> >>> class Sole24OreItem(Item): >>> url = Field() >>> pass >>> >>> sole.py: >>> >>> from scrapy.spider import BaseSpider >>> from scrapy.selector import HtmlXPathSelector >>> from sole24ore.items import Sole24OreItem >>> >>> class SoleSpider(BaseSpider): >>> name = 'sole' >>> allowed_domains = ['sole24ore.com'] >>> start_urls = ['http://www.sole24ore.com/'] >>> >>> def parse_item(self, response): >>> >>> hxs = HtmlXPathSelector(response) >>> item = Sole24OreItem() >>> url = hxs.select('//a[contains(@href, "http")]/@href').extract() >>> item['url'] = url >>> >>> return item >>> >>> SPIDER = SoleSpider() >>> >>> sole24ore]$scrapy crawl sole >>> 2014-02-07 16:04:41+0000 [scrapy] INFO: Scrapy 0.18.4 started (bot: >>> sole24ore) >>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Optional features available: >>> ssl, http11, boto >>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Overridden settings: >>> {'NEWSPIDER_MODULE': 'sole24ore.spiders', 'SPIDER_MODULES': >>> ['sole24ore.spiders'], 'BOT_NAME': 'sole24ore'} >>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled extensions: LogStats, >>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled downloader middlewares: >>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, >>> ChunkedTransferMiddleware, DownloaderStats >>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled spider middlewares: >>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >>> UrlLengthMiddleware, DepthMiddleware >>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled item pipelines: >>> 2014-02-07 16:04:41+0000 [sole] INFO: Spider opened >>> 2014-02-07 16:04:41+0000 [sole] INFO: Crawled 0 pages (at 0 pages/min), >>> scraped 0 items (at 0 items/min) >>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Telnet console listening on >>> 0.0.0.0:6024 >>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Web service listening on >>> 0.0.0.0:6081 >>> 2014-02-07 16:04:41+0000 [sole] DEBUG: Redirecting (301) to <GET >>> http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/> >>> 2014-02-07 16:04:41+0000 [sole] DEBUG: Crawled (200) <GET >>> http://www.ilsole24ore.com/> (referer: None) >>> 2014-02-07 16:04:41+0000 [sole] ERROR: Spider error processing <GET >>> http://www.ilsole24ore.com/> >>> Traceback (most recent call last): >>> File >>> "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in >>> mainLoop >>> self.runUntilCurrent() >>> File >>> "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in >>> runUntilCurrent >>> call.func(*call.args, **call.kw) >>> File >>> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in >>> callback >>> self._startRunCallbacks(result) >>> File >>> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in >>> _startRunCallbacks >>> self._runCallbacks() >>> --- <exception caught here> --- >>> File >>> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in >>> _runCallbacks >>> current.result = callback(current.result, *args, **kw) >>> File "/usr/lib/pymodules/python2.7/scrapy/spider.py", line 57, >>> in parse >>> raise NotImplementedError >>> exceptions.NotImplementedError: >>> >>> 2014-02-07 16:04:41+0000 [sole] INFO: Closing spider (finished) >>> 2014-02-07 16:04:41+0000 [sole] INFO: Dumping Scrapy stats: >>> {'downloader/request_bytes': 448, >>> 'downloader/request_count': 2, >>> 'downloader/request_method_count/GET': 2, >>> 'downloader/response_bytes': 47635, >>> 'downloader/response_count': 2, >>> 'downloader/response_status_count/200': 1, >>> 'downloader/response_status_count/301': 1, >>> 'finish_reason': 'finished', >>> 'finish_time': datetime.datetime(2014, 2, 7, 16, 4, 41, 585750), >>> 'log_count/DEBUG': 8, >>> 'log_count/ERROR': 1, >>> 'log_count/INFO': 3, >>> 'response_received_count': 1, >>> 'scheduler/dequeued': 2, >>> 'scheduler/dequeued/memory': 2, >>> 'scheduler/enqueued': 2, >>> 'scheduler/enqueued/memory': 2, >>> 'spider_exceptions/NotImplementedError': 1, >>> 'start_time': datetime.datetime(2014, 2, 7, 16, 4, 41, 240417)} >>> 2014-02-07 16:04:41+0000 [sole] INFO: Spider closed (finished) >>> >>> >>> Any hints to help me? >>> >>> Thank you very much. >>> Kind regards. >>> Marco >>> >>> >>> >>> >>> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
