Greetings all:
I'm new to scrapy and managed to get everything installed and working.
However my simple test project has proven not so simple, at least for me.
I'm simply wanting to request the home page of t 1 c k e t m a s t e r d o
t c o m, click the red Just Announced tab down the middle of the page and
-o the list of results out to an email address once a day via cron. I want
to be able to keep up with the announcements because their mailing lists
simply don't send them soon enough.
Here is my starting spider, which I've tested with other sites and its
works fine. I believe the error is due to it being a javascript rendered
site. I've used firebug to look for clues but I'm too new at this to
understand as well as understand javascript. I'm hoping someone would be
willing to point this noob a direction. I've also tried removing
middleware in the settings.py file with same results.
I've purposely masked out the site address as though I don't mean any harm,
I'm not quite sure of their ToS as of yet. I plan to poll once a day
anyway for personal use.
import scrapy
from tm.items import TmItem
class TmSpider(scrapy.Spider):
name = "tm"
allowed_domains = ["www.************.com"]
start_urls = [
"http://www.***********.com"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
scrapy crawl tm results in the following:
:0: UserWarning: You do not have a working installation of the
service_identity module: 'No module named service_identity'. Please
install it from <https://pypi.python.org/pypi/service_identity> and make
sure all of its dependencies are satisfied. Without the service_identity
module and a recent enough pyOpenSSL to support it, Twisted can perform
only rudimentary TLS client hostname verification. Many valid
certificate/hostname mappings may be rejected.
2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available: ssl,
http11
2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'],
'BOT_NAME': 'tm'}
2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats,
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
ChunkedTransferMiddleware, DownloaderStats
2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
UrlLengthMiddleware, DepthMiddleware
2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines:
2015-04-02 00:30:12-0500 [tm] INFO: Spider opened
2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on
127.0.0.1:6023
2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on
127.0.0.1:6080
2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET
http://www.****************com> (referer: None)
2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET
http://www.****************.com>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",
line 1201, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",
line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
line 383, in callback
self._startRunCallbacks(result)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
line 491, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
line 578, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
with open(filename, 'wb') as f:
exceptions.IOError: [Errno 2] No such file or directory: ''
2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished)
2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 73266,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/IOError': 1,
'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)}
2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished)
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.