Hi Daniel, actually I run /crawlscrape/urls_listing/urls_listing$ scrapy crawl urls_grasping -o items.json -t json and a file json has been created: /crawlscrape/urls_listing/urls_listing$ ls -lh items.json -rw-rw-r-- 1 marco marco 93K gen 23 06:20 items.json
so it seems that some settings in scrapyd-deploy has be to fine-tuned. Do you know anyone to ask to? Thank you very much for your kind help. Marco 2015-01-22 20:16 GMT+01:00 Daniel Fockler <[email protected]>: > Alright so it looks like running your project using scrapyd-deploy is > changing the output settings, so your item output files are going to > > /var/lib/scrapyd/items/urls_listing/urls_grasping/8d8e0dfaa20d11e48c91c04a00090e80.jl > > For some reason. In your project you can try just running scrapyd without > scrapyd-deploy and that should allow you to scrape using the correct > settings. I don't have a ton of experience using the deploy features with > scrapyd-deploy, so I'm not sure I can help much with that. > > On Thursday, January 22, 2015 at 12:16:27 AM UTC-8, Marco Ippolito wrote: >> >> Hi Daniel, >> not seeing any logs I decided to create a new project from scratch. >> But I still have the same problems. >> I attached the compressed (.tar) log files together with the >> compressed (.tar) urls_listing prj directory. >> Feel free to have a try. >> >> Looking forward to your kind help. >> Marco >> >> 2015-01-21 22:09 GMT+01:00 Daniel Fockler <[email protected]>: >> > If you end up with an empty output.json or a file that just has a '[' >> > character that could mean that scrapy couldn't find any items from your >> > spider. But if that is not the case then there is another issue. Scrapyd >> > should output logs for every spider that you run, in a logs directory >> > >> > On Wednesday, January 21, 2015 at 11:41:49 AM UTC-8, Marco Ippolito >> > wrote: >> >> >> >> Hi Daniel, >> >> thanks again for helping. >> >> >> >> I tried with >> >> FEED_URI = 'file://home/marco/crawlscrape/sole24ore/output.json' >> >> FEED_FORMAT = 'json' >> >> >> >> and with >> >> FEED_URI = 'output.json' >> >> FEED_FORMAT = 'json' >> >> in both cases there is no output and not error message >> >> >> >> Any hints? >> >> >> >> Marco >> >> >> >> 2015-01-21 20:28 GMT+01:00 Daniel Fockler <[email protected]>: >> >> > You'll want to make sure in your settings.py that feed format is set, >> >> > like >> >> > >> >> > FEED_FORMAT = 'json' >> >> > >> >> > If it doesn't work after that then just try changing feed uri to >> >> > >> >> > FEED_URI = 'output.json' >> >> > >> >> > and scrapy will dump it in your project root >> >> > >> >> > On Tuesday, January 20, 2015 at 11:00:50 PM UTC-8, Marco Ippolito >> >> > wrote: >> >> >> >> >> >> Hi Daniel, >> >> >> thank you very much for your kind help. >> >> >> >> >> >> After scheduling the spider run, an output is actually produced: >> >> >> >> >> >> Opening file >> >> >> >> >> >> >> >> >> /var/lib/scrapyd/items/sole24ore/sole/89d644f8a13a11e4a2afc04a00090e80.jl >> >> >> Read output! >> >> >> This is my output: >> >> >> {"url": ["http://m.bbc.co.uk", "http://www.bbc.com/news/", " ..... >> >> >> >> >> >> But modifying the feed settings as: >> >> >> BOT_NAME = 'sole24ore' >> >> >> >> >> >> SPIDER_MODULES = ['sole24ore.spiders'] >> >> >> NEWSPIDER_MODULE = 'sole24ore.spiders' >> >> >> >> >> >> FEED_URI = 'file://home/marco/crawlscrape/sole24ore/output.json' >> >> >> >> >> >> doesn't produce an output.json into >> >> >> /home/marco/crawlscrape/sole24ore >> >> >> >> >> >> am I missing some other steps? >> >> >> >> >> >> Marco >> >> >> >> >> >> 2015-01-20 18:45 GMT+01:00 Daniel Fockler <[email protected]>: >> >> >> > For your first problem, you've started the scrapyd project but you >> >> >> > need >> >> >> > to >> >> >> > schedule a spider run using the schedule.json command. Something >> >> >> > like >> >> >> > >> >> >> > curl http://localhost:6800/schedule.json -d project=sole24ore -d >> >> >> > spider=yourspidername >> >> >> > >> >> >> > For your second problem your settings.py is misconfigured your >> >> >> > feed >> >> >> > settings >> >> >> > should be like >> >> >> > >> >> >> > FEED_URI = 'file://home/marco/crawlscrape/sole24ore/output.json' >> >> >> > FEED_FORMAT = 'json' >> >> >> > >> >> >> > Hope that helps >> >> >> > >> >> >> > On Tuesday, January 20, 2015 at 4:23:04 AM UTC-8, Marco Ippolito >> >> >> > wrote: >> >> >> >> >> >> >> >> Hi, >> >> >> >> I' ve got 2 situations to solve. >> >> >> >> >> >> >> >> Seems that everything is ok: >> >> >> >> >> >> >> >> (SCREEN)marco@pc:~/crawlscrape/sole24ore$ scrapyd-deploy >> >> >> >> sole24ore >> >> >> >> -p >> >> >> >> sole24ore >> >> >> >> Packing version 1421755479 >> >> >> >> Deploying to project "sole24ore" in >> >> >> >> http://localhost:6800/addversion.json >> >> >> >> Server response (200): >> >> >> >> {"status": "ok", "project": "sole24ore", "version": "1421755479", >> >> >> >> "spiders": 1} >> >> >> >> >> >> >> >> >> >> >> >> marco@pc:/var/lib/scrapyd/dbs$ ls -lah >> >> >> >> totale 12K >> >> >> >> drwxr-xr-x 2 scrapy nogroup 4,0K gen 20 13:04 . >> >> >> >> drwxr-xr-x 5 scrapy nogroup 4,0K gen 20 06:55 .. >> >> >> >> -rw-r--r-- 1 root root 2,0K gen 20 13:04 sole24ore.db >> >> >> >> >> >> >> >> >> >> >> >> marco@pc:/var/lib/scrapyd/eggs/sole24ore$ ls -lah >> >> >> >> totale 16K >> >> >> >> drwxr-xr-x 2 scrapy nogroup 4,0K gen 20 13:04 . >> >> >> >> drwxr-xr-x 3 scrapy nogroup 4,0K gen 20 12:47 .. >> >> >> >> -rw-r--r-- 1 scrapy nogroup 5,5K gen 20 13:04 1421755479.egg >> >> >> >> >> >> >> >> >> >> >> >> , but nothing is executed >> >> >> >> >> >> >> >> marco@pc:/var/lib/scrapyd/items/sole24ore/sole$ ls -a >> >> >> >> . .. >> >> >> >> >> >> >> >> [detached from 2515.pts-4.pc] >> >> >> >> marco@pc:~/crawlscrape/sole24ore$ curl >> >> >> >> http://localhost:6800/listjobs.json?project=sole24ore >> >> >> >> {"status": "ok", "running": [], "finished": [], "pending": []} >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> The second aspect regards how to save the output into a json >> >> >> >> file. >> >> >> >> What is the correct form to put into settings.py? >> >> >> >> >> >> >> >> ile Edit Options Buffers Tools Python Help >> >> >> >> # Scrapy settings for sole24ore project >> >> >> >> # >> >> >> >> # For simplicity, this file contains only the most important >> >> >> >> settings >> >> >> >> by >> >> >> >> # default. All the other settings are documented here: >> >> >> >> # >> >> >> >> # http://doc.scrapy.org/en/latest/topics/settings.html >> >> >> >> # >> >> >> >> >> >> >> >> BOT_NAME = 'sole24ore' >> >> >> >> >> >> >> >> SPIDER_MODULES = ['sole24ore.spiders'] >> >> >> >> NEWSPIDER_MODULE = 'sole24ore.spiders' >> >> >> >> >> >> >> >> FEED_URI=file://home/marco/crawlscrape/sole24ore/output.json >> >> >> >> --set >> >> >> >> FEED_FORMAT=json >> >> >> >> >> >> >> >> >> >> >> >> SCREEN)marco@pc:~/crawlscrape/sole24ore$ scrapyd-deploy sole24ore >> >> >> >> -p >> >> >> >> sole24ore >> >> >> >> Packing version 1421756389 >> >> >> >> Deploying to project "sole24ore" in >> >> >> >> http://localhost:6800/addversion.json >> >> >> >> Server response (200): >> >> >> >> {"status": "error", "message": "SyntaxError: invalid syntax"} >> >> >> >> >> >> >> >> >> >> >> >> # Crawl responsibly by identifying yourself (and your website) on >> >> >> >> the >> >> >> >> user-agent >> >> >> >> #USER_AGENT = 'sole24ore (+http://www.yourdomain.com)' >> >> >> >> >> >> >> >> Looking forward to your kind help. >> >> >> >> Kind regards. >> >> >> >> Marco >> >> >> > >> >> >> > -- >> >> >> > You received this message because you are subscribed to a topic in >> >> >> > the >> >> >> > Google Groups "scrapy-users" group. >> >> >> > To unsubscribe from this topic, visit >> >> >> > >> >> >> > >> >> >> > https://groups.google.com/d/topic/scrapy-users/0b4xqaHUOSA/unsubscribe. >> >> >> > To unsubscribe from this group and all its topics, send an email >> >> >> > to >> >> >> > [email protected]. >> >> >> > To post to this group, send email to [email protected]. >> >> >> > Visit this group at http://groups.google.com/group/scrapy-users. >> >> >> > For more options, visit https://groups.google.com/d/optout. >> >> > >> >> > -- >> >> > You received this message because you are subscribed to a topic in >> >> > the >> >> > Google Groups "scrapy-users" group. >> >> > To unsubscribe from this topic, visit >> >> > >> >> > https://groups.google.com/d/topic/scrapy-users/0b4xqaHUOSA/unsubscribe. >> >> > To unsubscribe from this group and all its topics, send an email to >> >> > [email protected]. >> >> > To post to this group, send email to [email protected]. >> >> > Visit this group at http://groups.google.com/group/scrapy-users. >> >> > For more options, visit https://groups.google.com/d/optout. >> > >> > -- >> > You received this message because you are subscribed to a topic in the >> > Google Groups "scrapy-users" group. >> > To unsubscribe from this topic, visit >> > https://groups.google.com/d/topic/scrapy-users/0b4xqaHUOSA/unsubscribe. >> > To unsubscribe from this group and all its topics, send an email to >> > [email protected]. >> > To post to this group, send email to [email protected]. >> > Visit this group at http://groups.google.com/group/scrapy-users. >> > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to a topic in the > Google Groups "scrapy-users" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/scrapy-users/0b4xqaHUOSA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
