2015-01-26 18:52:08+0100 [urls_grasping] INFO: Stored jsonlines feed (1 items) in: /var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.jl right?
Em sábado, 31 de janeiro de 2015 08:00:42 UTC-2, Marco Ippolito escreveu: > > Any suggestions? > > Marco > > On Monday, 26 January 2015 19:00:14 UTC+1, Marco Ippolito wrote: >> >> Hi, >> >> I'm trying to export the scrapyd's output to a json file. >> >> marco@pc:~/crawlscrape/urls_listing$ curl >> http://localhost:6800/listversions.json?project="urls_listing" >> {"status": "ok", "versions": []} >> marco@pc:~/crawlscrape/urls_listing$ scrapyd-deploy urls_listing -p >> urls_listing >> Packing version 1422294714 >> Deploying to project "urls_listing" in >> http://localhost:6800/addversion.json >> Server response (200): >> {"status": "ok", "project": "urls_listing", "version": "1422294714", >> "spiders": 1} >> >> marco@pc:~/crawlscrape/urls_listing$ curl >> http://localhost:6800/schedule.json -d project=urls_listing -d >> spider=urls_grasping >> {"status": "ok", "jobid": "0b4518bea58411e482bcc04a00090e80"} >> >> And this is the log file: >> >> 2015-01-26 18:52:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: >> urls_listing) >> 2015-01-26 18:52:08+0100 [scrapy] INFO: Optional features available: ssl, >> http11 >> 2015-01-26 18:52:08+0100 [scrapy] INFO: Overridden settings: >> {'NEWSPIDER_MODULE': 'urls_listing.spiders', 'SPIDER_MODULES': >> ['urls_listing.spiders'], 'FEED_URI': >> '/var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04\ >> a00090e80.jl', 'LOG_FILE': >> '/var/log/scrapyd/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.log', >> >> 'BOT_NAME': 'urls_listing'} >> 2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled extensions: FeedExporter, >> LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >> 2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled downloader middlewares: >> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >> HttpCompressionMiddleware, Red\ >> irectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, >> DownloaderStats >> 2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled spider middlewares: >> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >> UrlLengthMiddleware, DepthMiddleware >> 2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled item pipelines: >> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Spider opened >> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2015-01-26 18:52:08+0100 [scrapy] DEBUG: Telnet console listening on >> 127.0.0.1:6023 >> 2015-01-26 18:52:08+0100 [scrapy] DEBUG: Web service listening on >> 127.0.0.1:6080 >> 2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Redirecting (301) to <GET >> http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/> >> 2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Crawled (200) <GET >> http://www.ilsole24ore.com/> (referer: None) >> 2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Scraped from <200 >> http://www.ilsole24ore.com/> >> {'url': [u' >> http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/ravvedimento/index.shtml >> ', >> u' >> http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/ravvedimento/index.shtml >> ', >> u'http://www.ilsole24ore.com/cultura.shtml', >> u'http://www.casa24.ilsole24ore.com/', >> u'http://www.moda24.ilsole24ore.com/', >> u'http://food24.ilsole24ore.com/', >> u'http://www.motori24.ilsole24ore.com/', >> u'http://job24.ilsole24ore.com/', >> u'http://stream24.ilsole24ore.com/', >> u'http://www.viaggi24.ilsole24ore.com/', >> u'http://www.salute24.ilsole24ore.com/', >> u'http://www.shopping24.ilsole24ore.com/', >> u'http://www.radio24.ilsole24ore.com/', >> u'http://america24.com/', >> u'http://meteo24.ilsole24ore.com/', >> u'https://24orecloud.ilsole24ore.com/', >> u'http://www.ilsole24ore.com/feed/agora/agora.shtml', >> u'http://www.formazione.ilsole24ore.com/', >> u'http://nova.ilsole24ore.com/', >> ......(omitted) >> u'http://websystem.ilsole24ore.com/', >> u'http://www.omniture.com']} >> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Closing spider (finished) >> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Stored jsonlines feed (1 >> items) in: >> /var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.jl >> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Dumping Scrapy stats: >> {'downloader/request_bytes': 434, >> 'downloader/request_count': 2, >> 'downloader/request_method_count/GET': 2, >> 'downloader/response_bytes': 51709, >> 'downloader/response_count': 2, >> 'downloader/response_status_count/200': 1, >> 'downloader/response_status_count/301': 1, >> 'finish_reason': 'finished', >> 'finish_time': datetime.datetime(2015, 1, 26, 17, 52, 8, 820513), >> 'item_scraped_count': 1, >> 'log_count/DEBUG': 5, >> 'log_count/INFO': 8, >> 'response_received_count': 1, >> 'scheduler/dequeued': 2, >> 'scheduler/dequeued/memory': 2, >> 'scheduler/enqueued': 2, >> 'scheduler/enqueued/memory': 2, >> 'start_time': datetime.datetime(2015, 1, 26, 17, 52, 8, 612923)} >> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Spider closed (finished) >> >> >> But there is no output.json: >> marco@pc:~/crawlscrape/urls_listing$ ls -a >> . .. build project.egg-info scrapy.cfg setup.py urls_listing >> >> in ~/crawlscrape/urls_listing/urls/listing: >> in items.py: >> class UrlsListingItem(scrapy.Item): >> # define the fields for your item here like: >> #url = scrapy.Field() >> #url = scrapy.Field(serializer=UrlsListingJsonExporter) >> url = scrapy.Field(serializer=serialize_url) >> pass >> >> in pipelines.py I put: >> >> class JsonExportPipeline(object):ì >> def __init__(self): >> dispatcher.connect(self.spider_opened, signals.spider_opened) >> dispatcher.connect(self.spider_closed, signals.spider_closed) >> self.files = {} >> def spider_opened(self, spider): >> file = open('%s_items.json' % spider.name, 'w+b') >> self.files[spider] = file >> self.exporter = JsonLinesItemExporter(file) >> self.exporter.start_exporting()ì >> def spider_closed(self, spider): >> self.exporter.finish_exporting() >> file = self.files.pop(spider) >> file.close() >> def process_item(self, item, spider): >> self.exporter.export_item(item) >> return item >> >> in settings.py I put: >> BOT_NAME = 'urls_listing' >> >> SPIDER_MODULES = ['urls_listing.spiders'] >> NEWSPIDER_MODULE = 'urls_listing.spiders' >> >> FEED_URI = 'file://home/marco/crawlscrape/urls_listing/output.json' >> #FEED_URI = 'output.json' >> FEED_FORMAT = 'jsonlines' >> >> FEED_EXPORTERS = { >> 'jsonlines': 'scrapy.contrib.exporter.JsonLinesItemExporter', >> } >> >> What am I doing wrongly? >> Looking forward to your kind help. >> Marco >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.