This looks promising, I'll let everyone know if it works. https://github.com/scrapinghub/scrapyjs
On Thursday, April 2, 2015 at 3:48:06 PM UTC-5, Troy Perkins wrote: > > I've been Googling around all day on how to scrape a javascript page with > scrapy. I think thats the issue. From what I've found Scrapy doesn't > support parsing javascript and making use of Selenium is the only > workaround. Thats too much overhead for what I'm wanting to do... oh well. > Hoping to find another solution. Thanks for your help Travis, it was > greatly appreciated. > > On Thursday, April 2, 2015 at 3:03:37 PM UTC-5, Travis Leleu wrote: >> >> Your recent debug output doesn't have that error, so you must have fixed >> it. >> >> The current error feels like it's either a javascript-loaded page, or >> you're getting blocked from scraping by the server. >> >> Google around for how to scrape a javscript page with scrapy, and using a >> proxy. Those guides will be your friend. >> >> On Thu, Apr 2, 2015 at 12:58 PM, Troy Perkins <[email protected]> >> wrote: >> >>> Hi Travis, thanks for the response. Not sure why its not able to find >>> it, its there, see below: >>> >>> pawnbahnimac:spiders pawnbahn$ pwd >>> /Users/pawnbahn/tm/tm/spiders >>> pawnbahnimac:spiders pawnbahn$ ls >>> Books Resources __init__.py __init__.pyc items.json tm_spider.py >>> tm_spider.pyc >>> pawnbahnimac:spiders pawnbahn$ >>> >>> It only behave like this on this site for some reason. Running the dmoz >>> example works fine. >>> >>> pawnbahnimac:spiders pawnbahn$ scrapy crawl tm >>> :0: UserWarning: You do not have a working installation of the >>> service_identity module: 'No module named service_identity'. Please >>> install it from <https://pypi.python.org/pypi/service_identity> and >>> make sure all of its dependencies are satisfied. Without the >>> service_identity module and a recent enough pyOpenSSL to support it, >>> Twisted can perform only rudimentary TLS client hostname verification. >>> Many valid certificate/hostname mappings may be rejected. >>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm) >>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Optional features available: >>> ssl, http11 >>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Overridden settings: >>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], >>> 'BOT_NAME': 'tm'} >>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled extensions: LogStats, >>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled downloader middlewares: >>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, >>> ChunkedTransferMiddleware, DownloaderStats >>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled spider middlewares: >>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >>> UrlLengthMiddleware, DepthMiddleware >>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled item pipelines: >>> 2015-04-02 14:56:01-0500 [tm] INFO: Spider opened >>> 2015-04-02 14:56:01-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), >>> scraped 0 items (at 0 items/min) >>> 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Telnet console listening on >>> 127.0.0.1:6023 >>> 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Web service listening on >>> 127.0.0.1:6080 >>> 2015-04-02 14:56:01-0500 [tm] DEBUG: Crawled (200) <GET >>> http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> >>> (referer: None) >>> 2015-04-02 14:56:01-0500 [tm] INFO: Closing spider (finished) >>> 2015-04-02 14:56:01-0500 [tm] INFO: Dumping Scrapy stats: >>> {'downloader/request_bytes': 260, >>> 'downloader/request_count': 1, >>> 'downloader/request_method_count/GET': 1, >>> 'downloader/response_bytes': 6234, >>> 'downloader/response_count': 1, >>> 'downloader/response_status_count/200': 1, >>> 'finish_reason': 'finished', >>> 'finish_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 861714), >>> 'log_count/DEBUG': 3, >>> 'log_count/INFO': 7, >>> 'response_received_count': 1, >>> 'scheduler/dequeued': 1, >>> 'scheduler/dequeued/memory': 1, >>> 'scheduler/enqueued': 1, >>> 'scheduler/enqueued/memory': 1, >>> 'start_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 494696)} >>> 2015-04-02 14:56:01-0500 [tm] INFO: Spider closed (finished) >>> >>> >>> >>> On Thursday, April 2, 2015 at 11:30:41 AM UTC-5, Travis Leleu wrote: >>>> >>>> Python can't find the file whose path is stored in filename. Used in >>>> line 13 of your spider. Read your scrapy debug output to find out more >>>> information. >>>> >>>> File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse >>>> with open(filename, 'wb') as f: >>>> exceptions.IOError: [Errno 2] No such file or directory: '' >>>> >>>> On Wed, Apr 1, 2015 at 10:38 PM, Troy Perkins <[email protected]> >>>> wrote: >>>> >>>>> Greetings all: >>>>> >>>>> I'm new to scrapy and managed to get everything installed and >>>>> working. However my simple test project has proven not so simple, at >>>>> least >>>>> for me. >>>>> >>>>> I'm simply wanting to request the home page of t 1 c k e t m a s t e r >>>>> d o t c o m, click the red Just Announced tab down the middle of the page >>>>> and -o the list of results out to an email address once a day via cron. >>>>> I >>>>> want to be able to keep up with the announcements because their mailing >>>>> lists simply don't send them soon enough. >>>>> >>>>> Here is my starting spider, which I've tested with other sites and its >>>>> works fine. I believe the error is due to it being a javascript rendered >>>>> site. I've used firebug to look for clues but I'm too new at this to >>>>> understand as well as understand javascript. I'm hoping someone would be >>>>> willing to point this noob a direction. I've also tried removing >>>>> middleware in the settings.py file with same results. >>>>> >>>>> I've purposely masked out the site address as though I don't mean any >>>>> harm, I'm not quite sure of their ToS as of yet. I plan to poll once a >>>>> day >>>>> anyway for personal use. >>>>> >>>>> import scrapy >>>>> >>>>> from tm.items import TmItem >>>>> >>>>> class TmSpider(scrapy.Spider): >>>>> name = "tm" >>>>> allowed_domains = ["www.************.com"] >>>>> start_urls = [ >>>>> "http://www.***********.com" >>>>> ] >>>>> def parse(self, response): >>>>> filename = response.url.split("/")[-2] >>>>> with open(filename, 'wb') as f: >>>>> f.write(response.body) >>>>> >>>>> scrapy crawl tm results in the following: >>>>> >>>>> :0: UserWarning: You do not have a working installation of the >>>>> service_identity module: 'No module named service_identity'. Please >>>>> install it from <https://pypi.python.org/pypi/service_identity> and >>>>> make sure all of its dependencies are satisfied. Without the >>>>> service_identity module and a recent enough pyOpenSSL to support it, >>>>> Twisted can perform only rudimentary TLS client hostname verification. >>>>> Many valid certificate/hostname mappings may be rejected. >>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm) >>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available: >>>>> ssl, http11 >>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings: >>>>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], >>>>> 'BOT_NAME': 'tm'} >>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats, >>>>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader >>>>> middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, >>>>> UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, >>>>> MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, >>>>> CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats >>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares: >>>>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >>>>> UrlLengthMiddleware, DepthMiddleware >>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines: >>>>> 2015-04-02 00:30:12-0500 [tm] INFO: Spider opened >>>>> 2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), >>>>> scraped 0 items (at 0 items/min) >>>>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on >>>>> 127.0.0.1:6023 >>>>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on >>>>> 127.0.0.1:6080 >>>>> 2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET http://www. >>>>> ****************com> (referer: None) >>>>> 2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET >>>>> http://www.****************.com> >>>>> Traceback (most recent call last): >>>>> File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", >>>>> line 1201, in mainLoop >>>>> self.runUntilCurrent() >>>>> File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", >>>>> line 824, in runUntilCurrent >>>>> call.func(*call.args, **call.kw) >>>>> File >>>>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", >>>>> line 383, in callback >>>>> self._startRunCallbacks(result) >>>>> File >>>>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", >>>>> line 491, in _startRunCallbacks >>>>> self._runCallbacks() >>>>> --- <exception caught here> --- >>>>> File >>>>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", >>>>> line 578, in _runCallbacks >>>>> current.result = callback(current.result, *args, **kw) >>>>> File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse >>>>> with open(filename, 'wb') as f: >>>>> exceptions.IOError: [Errno 2] No such file or directory: '' >>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished) >>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats: >>>>> {'downloader/request_bytes': 219, >>>>> 'downloader/request_count': 1, >>>>> 'downloader/request_method_count/GET': 1, >>>>> 'downloader/response_bytes': 73266, >>>>> 'downloader/response_count': 1, >>>>> 'downloader/response_status_count/200': 1, >>>>> 'finish_reason': 'finished', >>>>> 'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001), >>>>> 'log_count/DEBUG': 3, >>>>> 'log_count/ERROR': 1, >>>>> 'log_count/INFO': 7, >>>>> 'response_received_count': 1, >>>>> 'scheduler/dequeued': 1, >>>>> 'scheduler/dequeued/memory': 1, >>>>> 'scheduler/enqueued': 1, >>>>> 'scheduler/enqueued/memory': 1, >>>>> 'spider_exceptions/IOError': 1, >>>>> 'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)} >>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished) >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "scrapy-users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
