Hi Marco, It looks you didn't specify correct regexp in rules.
Середа, 5 лютого 2014 р. 13:53:10 UTC+2 користувач Marco Ippolito написав: > > Hi everybody, > > I'm new to scrapy and I would like to create a scraper returning a list of > urls, (just the urls) in a json format (to be later exported in redis), > belonging to the same domain (ex:www.ilsole24ore.com). > > items.py : > > from scrapy.item import Item, Field > class Sole24OreItem(Item): > url = Field() > pass > > in spiders: > sole.py: > from scrapy.selector import HtmlXPathSelector > from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor > from scrapy.contrib.spiders import CrawlSpider, Rule > from sole24ore.items import Sole24OreItem > from scrapy.contrib.exporter import JsonLinesItemExporter > > class SoleSpider(CrawlSpider): > name = 'sole' > allowed_domains = ['sole24ore.com'] > start_urls = ['http://www.sole24ore.com/'] > > rules = ( > Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', > follow=True), > ) > > def parse_item(self, response): > l = XPathItemLoader(item=Website(), response=response) > hxs = HtmlXPathSelector(response) > sites = hxs.select('//ul[@class="directory-url"]/li') > items = [] > for site in sites: > item = Website() > item['url'] = site.select('a/@href').extract() > items.append(item) > return items > > Running scrapy: > sole24ore]$scrapy crawl sole > 2014-02-05 11:50:30+0000 [scrapy] INFO: Scrapy 0.18.4 started (bot: > sole24ore) > 2014-02-05 11:50:30+0000 [scrapy] DEBUG: Optional features available: ssl, > http11, boto > 2014-02-05 11:50:30+0000 [scrapy] DEBUG: Overridden settings: > {'NEWSPIDER_MODULE': 'sole24ore.spiders', 'SPIDER_MODULES': > ['sole24ore.spiders'], 'BOT_NAME': 'sole24ore'} > 2014-02-05 11:50:30+0000 [scrapy] DEBUG: Enabled extensions: LogStats, > TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState > 2014-02-05 11:50:30+0000 [scrapy] DEBUG: Enabled downloader middlewares: > HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, > RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, > HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, > ChunkedTransferMiddleware, DownloaderStats > 2014-02-05 11:50:30+0000 [scrapy] DEBUG: Enabled spider middlewares: > HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, > UrlLengthMiddleware, DepthMiddleware > 2014-02-05 11:50:30+0000 [scrapy] DEBUG: Enabled item pipelines: > 2014-02-05 11:50:30+0000 [sole] INFO: Spider opened > 2014-02-05 11:50:30+0000 [sole] INFO: Crawled 0 pages (at 0 pages/min), > scraped 0 items (at 0 items/min) > 2014-02-05 11:50:30+0000 [scrapy] DEBUG: Telnet console listening on > 0.0.0.0:6023 > 2014-02-05 11:50:30+0000 [scrapy] DEBUG: Web service listening on > 0.0.0.0:6080 > 2014-02-05 11:50:31+0000 [sole] DEBUG: Redirecting (301) to <GET > http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/> > 2014-02-05 11:50:31+0000 [sole] DEBUG: Crawled (200) <GET > http://www.ilsole24ore.com/> (referer: None) > 2014-02-05 11:50:31+0000 [sole] INFO: Closing spider (finished) > 2014-02-05 11:50:31+0000 [sole] INFO: Dumping Scrapy stats: > {'downloader/request_bytes': 448, > 'downloader/request_count': 2, > 'downloader/request_method_count/GET': 2, > 'downloader/response_bytes': 47616, > 'downloader/response_count': 2, > 'downloader/response_status_count/200': 1, > 'downloader/response_status_count/301': 1, > 'finish_reason': 'finished', > 'finish_time': datetime.datetime(2014, 2, 5, 11, 50, 31, 481605), > 'log_count/DEBUG': 8, > 'log_count/INFO': 3, > 'response_received_count': 1, > 'scheduler/dequeued': 2, > 'scheduler/dequeued/memory': 2, > 'scheduler/enqueued': 2, > 'scheduler/enqueued/memory': 2, > 'start_time': datetime.datetime(2014, 2, 5, 11, 50, 30, 930817)} > 2014-02-05 11:50:31+0000 [sole] INFO: Spider closed (finished) > > Any hints to make it work? > > Thank you very much for your kind help. > Kind regards. > Marco > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
