scrapy noob trying to do a simple data extraction (not so simple site)

Troy Perkins Thu, 02 Apr 2015 01:38:16 -0700

Greetings all:

I'm new to scrapy and managed to get everything installed and working. 
 However my simple test project has proven not so simple, at least for me.


I'm simply wanting to request the home page of t 1 c k e t m a s t e r d o 
t c o m, click the red Just Announced tab down the middle of the page and 
-o the list of results out to an email address once a day via cron.  I want 
to be able to keep up with the announcements because their mailing lists 
simply don't send them soon enough.

Here is my starting spider, which I've tested with other sites and its 
works fine.  I believe the error is due to it being a javascript rendered 
site.  I've used firebug to look for clues but I'm too new at this to 
understand as well as understand javascript.  I'm hoping someone would be 
willing to point this noob a direction.  I've also tried removing 
middleware in the settings.py file with same results.

I've purposely masked out the site address as though I don't mean any harm, 
I'm not quite sure of their ToS as of yet.  I plan to poll once a day 
anyway for personal use.

import scrapy

from tm.items import TmItem

class TmSpider(scrapy.Spider):
   name = "tm"
   allowed_domains = ["www.************.com"]
   start_urls = [
       "http://www.***********.com";
   ]
   def parse(self, response):
       filename = response.url.split("/")[-2]
       with open(filename, 'wb') as f:
           f.write(response.body)

scrapy crawl tm results in the following:

:0: UserWarning: You do not have a working installation of the 
service_identity module: 'No module named service_identity'.  Please 
install it from <https://pypi.python.org/pypi/service_identity> and make 
sure all of its dependencies are satisfied.  Without the service_identity 
module and a recent enough pyOpenSSL to support it, Twisted can perform 
only rudimentary TLS client hostname verification.  Many valid 
certificate/hostname mappings may be rejected.
2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available: ssl, 
http11
2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings: 
{'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], 
'BOT_NAME': 'tm'}
2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats, 
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader middlewares: 
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
ChunkedTransferMiddleware, DownloaderStats
2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares: 
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
UrlLengthMiddleware, DepthMiddleware
2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines: 
2015-04-02 00:30:12-0500 [tm] INFO: Spider opened
2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)
2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on 
127.0.0.1:6023
2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on 
127.0.0.1:6080
2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET 
http://www.****************com> (referer: None)
2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET 
http://www.****************.com>
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", 
line 1201, in mainLoop
    self.runUntilCurrent()
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", 
line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
line 383, in callback
    self._startRunCallbacks(result)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
line 491, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
line 578, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
    with open(filename, 'wb') as f:
exceptions.IOError: [Errno 2] No such file or directory: ''
2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished)
2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 73266,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/IOError': 1,
 'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)}
2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished)


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

scrapy noob trying to do a simple data extraction (not so simple site)

Reply via email to