Re: scrapy noob trying to do a simple data extraction (not so simple site)

Troy Perkins Thu, 02 Apr 2015 13:49:10 -0700

I've been Googling around all day on how to scrape a javascript page with 
scrapy.  I think thats the issue.  From what I've found Scrapy doesn't 
support parsing javascript and making use of Selenium is the only 
workaround.  Thats too much overhead for what I'm wanting to do... oh well. 
 Hoping to find another solution.  Thanks for your help Travis, it was 
greatly appreciated.


On Thursday, April 2, 2015 at 3:03:37 PM UTC-5, Travis Leleu wrote:
>
> Your recent debug output doesn't have that error, so you must have fixed 
> it.
>
> The current error feels like it's either a javascript-loaded page, or 
> you're getting blocked from scraping by the server.
>
> Google around for how to scrape a javscript page with scrapy, and using a 
> proxy.  Those guides will be your friend.
>
> On Thu, Apr 2, 2015 at 12:58 PM, Troy Perkins <[email protected] 
> <javascript:>> wrote:
>
>> Hi Travis, thanks for the response.  Not sure why its not able to find 
>> it, its there, see below:
>>
>> pawnbahnimac:spiders pawnbahn$ pwd
>> /Users/pawnbahn/tm/tm/spiders
>> pawnbahnimac:spiders pawnbahn$ ls
>> Books Resources __init__.py __init__.pyc items.json tm_spider.py 
>> tm_spider.pyc
>> pawnbahnimac:spiders pawnbahn$ 
>>
>> It only behave like this on this site for some reason.  Running the dmoz 
>> example works fine.
>>
>> pawnbahnimac:spiders pawnbahn$ scrapy crawl tm
>> :0: UserWarning: You do not have a working installation of the 
>> service_identity module: 'No module named service_identity'.  Please 
>> install it from <https://pypi.python.org/pypi/service_identity> and make 
>> sure all of its dependencies are satisfied.  Without the service_identity 
>> module and a recent enough pyOpenSSL to support it, Twisted can perform 
>> only rudimentary TLS client hostname verification.  Many valid 
>> certificate/hostname mappings may be rejected.
>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Optional features available: ssl, 
>> http11
>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Overridden settings: 
>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], 
>> 'BOT_NAME': 'tm'}
>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled extensions: LogStats, 
>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled downloader middlewares: 
>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
>> ChunkedTransferMiddleware, DownloaderStats
>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled spider middlewares: 
>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>> UrlLengthMiddleware, DepthMiddleware
>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled item pipelines: 
>> 2015-04-02 14:56:01-0500 [tm] INFO: Spider opened
>> 2015-04-02 14:56:01-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), 
>> scraped 0 items (at 0 items/min)
>> 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Telnet console listening on 
>> 127.0.0.1:6023
>> 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Web service listening on 
>> 127.0.0.1:6080
>> 2015-04-02 14:56:01-0500 [tm] DEBUG: Crawled (200) <GET 
>> http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> 
>> (referer: None)
>> 2015-04-02 14:56:01-0500 [tm] INFO: Closing spider (finished)
>> 2015-04-02 14:56:01-0500 [tm] INFO: Dumping Scrapy stats:
>> {'downloader/request_bytes': 260,
>>  'downloader/request_count': 1,
>>  'downloader/request_method_count/GET': 1,
>>  'downloader/response_bytes': 6234,
>>  'downloader/response_count': 1,
>>  'downloader/response_status_count/200': 1,
>>  'finish_reason': 'finished',
>>  'finish_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 861714),
>>  'log_count/DEBUG': 3,
>>  'log_count/INFO': 7,
>>  'response_received_count': 1,
>>  'scheduler/dequeued': 1,
>>  'scheduler/dequeued/memory': 1,
>>  'scheduler/enqueued': 1,
>>  'scheduler/enqueued/memory': 1,
>>  'start_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 494696)}
>> 2015-04-02 14:56:01-0500 [tm] INFO: Spider closed (finished)
>>
>>
>>
>> On Thursday, April 2, 2015 at 11:30:41 AM UTC-5, Travis Leleu wrote:
>>>
>>> Python can't find the file whose path is stored in filename.  Used in 
>>> line 13 of your spider.  Read your scrapy debug output to find out more 
>>> information.
>>>
>>>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>>>     with open(filename, 'wb') as f:
>>> exceptions.IOError: [Errno 2] No such file or directory: ''
>>>
>>> On Wed, Apr 1, 2015 at 10:38 PM, Troy Perkins <[email protected]> 
>>> wrote:
>>>
>>>> Greetings all:
>>>>
>>>> I'm new to scrapy and managed to get everything installed and working.  
>>>> However my simple test project has proven not so simple, at least for me.
>>>>
>>>> I'm simply wanting to request the home page of t 1 c k e t m a s t e r 
>>>> d o t c o m, click the red Just Announced tab down the middle of the page 
>>>> and -o the list of results out to an email address once a day via cron.  I 
>>>> want to be able to keep up with the announcements because their mailing 
>>>> lists simply don't send them soon enough.
>>>>
>>>> Here is my starting spider, which I've tested with other sites and its 
>>>> works fine.  I believe the error is due to it being a javascript rendered 
>>>> site.  I've used firebug to look for clues but I'm too new at this to 
>>>> understand as well as understand javascript.  I'm hoping someone would be 
>>>> willing to point this noob a direction.  I've also tried removing 
>>>> middleware in the settings.py file with same results.
>>>>
>>>> I've purposely masked out the site address as though I don't mean any 
>>>> harm, I'm not quite sure of their ToS as of yet.  I plan to poll once a 
>>>> day 
>>>> anyway for personal use.
>>>>
>>>> import scrapy
>>>>
>>>> from tm.items import TmItem
>>>>
>>>> class TmSpider(scrapy.Spider):
>>>>    name = "tm"
>>>>    allowed_domains = ["www.************.com"]
>>>>    start_urls = [
>>>>        "http://www.***********.com";
>>>>    ]
>>>>    def parse(self, response):
>>>>        filename = response.url.split("/")[-2]
>>>>        with open(filename, 'wb') as f:
>>>>            f.write(response.body)
>>>>
>>>> scrapy crawl tm results in the following:
>>>>
>>>> :0: UserWarning: You do not have a working installation of the 
>>>> service_identity module: 'No module named service_identity'.  Please 
>>>> install it from <https://pypi.python.org/pypi/service_identity> and 
>>>> make sure all of its dependencies are satisfied.  Without the 
>>>> service_identity module and a recent enough pyOpenSSL to support it, 
>>>> Twisted can perform only rudimentary TLS client hostname verification.  
>>>> Many valid certificate/hostname mappings may be rejected.
>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available: 
>>>> ssl, http11
>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings: 
>>>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], 
>>>> 'BOT_NAME': 'tm'}
>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats, 
>>>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader middlewares: 
>>>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
>>>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
>>>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
>>>> ChunkedTransferMiddleware, DownloaderStats
>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares: 
>>>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>>>> UrlLengthMiddleware, DepthMiddleware
>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines: 
>>>> 2015-04-02 00:30:12-0500 [tm] INFO: Spider opened
>>>> 2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), 
>>>> scraped 0 items (at 0 items/min)
>>>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on 
>>>> 127.0.0.1:6023
>>>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on 
>>>> 127.0.0.1:6080
>>>> 2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET http://www.
>>>> ****************com> (referer: None)
>>>> 2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET 
>>>> http://www.****************.com>
>>>> Traceback (most recent call last):
>>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", 
>>>> line 1201, in mainLoop
>>>>     self.runUntilCurrent()
>>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", 
>>>> line 824, in runUntilCurrent
>>>>     call.func(*call.args, **call.kw)
>>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
>>>> line 383, in callback
>>>>     self._startRunCallbacks(result)
>>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
>>>> line 491, in _startRunCallbacks
>>>>     self._runCallbacks()
>>>> --- <exception caught here> ---
>>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
>>>> line 578, in _runCallbacks
>>>>     current.result = callback(current.result, *args, **kw)
>>>>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>>>>     with open(filename, 'wb') as f:
>>>> exceptions.IOError: [Errno 2] No such file or directory: ''
>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished)
>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats:
>>>> {'downloader/request_bytes': 219,
>>>>  'downloader/request_count': 1,
>>>>  'downloader/request_method_count/GET': 1,
>>>>  'downloader/response_bytes': 73266,
>>>>  'downloader/response_count': 1,
>>>>  'downloader/response_status_count/200': 1,
>>>>  'finish_reason': 'finished',
>>>>  'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001),
>>>>  'log_count/DEBUG': 3,
>>>>  'log_count/ERROR': 1,
>>>>  'log_count/INFO': 7,
>>>>  'response_received_count': 1,
>>>>  'scheduler/dequeued': 1,
>>>>  'scheduler/dequeued/memory': 1,
>>>>  'scheduler/enqueued': 1,
>>>>  'scheduler/enqueued/memory': 1,
>>>>  'spider_exceptions/IOError': 1,
>>>>  'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)}
>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished)
>>>>
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: scrapy noob trying to do a simple data extraction (not so simple site)

Reply via email to