Re: scrapy noob trying to do a simple data extraction (not so simple site)

Travis Leleu Thu, 02 Apr 2015 09:31:17 -0700

Python can't find the file whose path is stored in filename.  Used in line
13 of your spider.  Read your scrapy debug output to find out more
information.


  File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
    with open(filename, 'wb') as f:
exceptions.IOError: [Errno 2] No such file or directory: ''

On Wed, Apr 1, 2015 at 10:38 PM, Troy Perkins <[email protected]>
wrote:

> Greetings all:
>
> I'm new to scrapy and managed to get everything installed and working.
> However my simple test project has proven not so simple, at least for me.
>
> I'm simply wanting to request the home page of t 1 c k e t m a s t e r d o
> t c o m, click the red Just Announced tab down the middle of the page and
> -o the list of results out to an email address once a day via cron.  I want
> to be able to keep up with the announcements because their mailing lists
> simply don't send them soon enough.
>
> Here is my starting spider, which I've tested with other sites and its
> works fine.  I believe the error is due to it being a javascript rendered
> site.  I've used firebug to look for clues but I'm too new at this to
> understand as well as understand javascript.  I'm hoping someone would be
> willing to point this noob a direction.  I've also tried removing
> middleware in the settings.py file with same results.
>
> I've purposely masked out the site address as though I don't mean any
> harm, I'm not quite sure of their ToS as of yet.  I plan to poll once a day
> anyway for personal use.
>
> import scrapy
>
> from tm.items import TmItem
>
> class TmSpider(scrapy.Spider):
>    name = "tm"
>    allowed_domains = ["www.************.com"]
>    start_urls = [
>        "http://www.***********.com";
>    ]
>    def parse(self, response):
>        filename = response.url.split("/")[-2]
>        with open(filename, 'wb') as f:
>            f.write(response.body)
>
> scrapy crawl tm results in the following:
>
> :0: UserWarning: You do not have a working installation of the
> service_identity module: 'No module named service_identity'.  Please
> install it from <https://pypi.python.org/pypi/service_identity> and make
> sure all of its dependencies are satisfied.  Without the service_identity
> module and a recent enough pyOpenSSL to support it, Twisted can perform
> only rudimentary TLS client hostname verification.  Many valid
> certificate/hostname mappings may be rejected.
> 2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
> 2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available: ssl,
> http11
> 2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings:
> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'],
> 'BOT_NAME': 'tm'}
> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats,
> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader middlewares:
> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,
> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
> ChunkedTransferMiddleware, DownloaderStats
> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares:
> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
> UrlLengthMiddleware, DepthMiddleware
> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines:
> 2015-04-02 00:30:12-0500 [tm] INFO: Spider opened
> 2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min),
> scraped 0 items (at 0 items/min)
> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on
> 127.0.0.1:6023
> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on
> 127.0.0.1:6080
> 2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET 
> http://www.****************com>
> (referer: None)
> 2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET
> http://www.****************.com>
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",
> line 1201, in mainLoop
>     self.runUntilCurrent()
>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",
> line 824, in runUntilCurrent
>     call.func(*call.args, **call.kw)
>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
> line 383, in callback
>     self._startRunCallbacks(result)
>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
> line 491, in _startRunCallbacks
>     self._runCallbacks()
> --- <exception caught here> ---
>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
> line 578, in _runCallbacks
>     current.result = callback(current.result, *args, **kw)
>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>     with open(filename, 'wb') as f:
> exceptions.IOError: [Errno 2] No such file or directory: ''
> 2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished)
> 2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats:
> {'downloader/request_bytes': 219,
>  'downloader/request_count': 1,
>  'downloader/request_method_count/GET': 1,
>  'downloader/response_bytes': 73266,
>  'downloader/response_count': 1,
>  'downloader/response_status_count/200': 1,
>  'finish_reason': 'finished',
>  'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001),
>  'log_count/DEBUG': 3,
>  'log_count/ERROR': 1,
>  'log_count/INFO': 7,
>  'response_received_count': 1,
>  'scheduler/dequeued': 1,
>  'scheduler/dequeued/memory': 1,
>  'scheduler/enqueued': 1,
>  'scheduler/enqueued/memory': 1,
>  'spider_exceptions/IOError': 1,
>  'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)}
> 2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished)
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: scrapy noob trying to do a simple data extraction (not so simple site)

Reply via email to