Re: scrapy noob trying to do a simple data extraction (not so simple site)

Troy Perkins Thu, 02 Apr 2015 14:13:43 -0700

This looks promising, I'll let everyone know if it works. 
 https://github.com/scrapinghub/scrapyjs


On Thursday, April 2, 2015 at 3:48:06 PM UTC-5, Troy Perkins wrote:
>
> I've been Googling around all day on how to scrape a javascript page with 
> scrapy.  I think thats the issue.  From what I've found Scrapy doesn't 
> support parsing javascript and making use of Selenium is the only 
> workaround.  Thats too much overhead for what I'm wanting to do... oh well. 
>  Hoping to find another solution.  Thanks for your help Travis, it was 
> greatly appreciated.
>
> On Thursday, April 2, 2015 at 3:03:37 PM UTC-5, Travis Leleu wrote:
>>
>> Your recent debug output doesn't have that error, so you must have fixed 
>> it.
>>
>> The current error feels like it's either a javascript-loaded page, or 
>> you're getting blocked from scraping by the server.
>>
>> Google around for how to scrape a javscript page with scrapy, and using a 
>> proxy.  Those guides will be your friend.
>>
>> On Thu, Apr 2, 2015 at 12:58 PM, Troy Perkins <[email protected]> 
>> wrote:
>>
>>> Hi Travis, thanks for the response.  Not sure why its not able to find 
>>> it, its there, see below:
>>>
>>> pawnbahnimac:spiders pawnbahn$ pwd
>>> /Users/pawnbahn/tm/tm/spiders
>>> pawnbahnimac:spiders pawnbahn$ ls
>>> Books Resources __init__.py __init__.pyc items.json tm_spider.py 
>>> tm_spider.pyc
>>> pawnbahnimac:spiders pawnbahn$ 
>>>
>>> It only behave like this on this site for some reason.  Running the dmoz 
>>> example works fine.
>>>
>>> pawnbahnimac:spiders pawnbahn$ scrapy crawl tm
>>> :0: UserWarning: You do not have a working installation of the 
>>> service_identity module: 'No module named service_identity'.  Please 
>>> install it from <https://pypi.python.org/pypi/service_identity> and 
>>> make sure all of its dependencies are satisfied.  Without the 
>>> service_identity module and a recent enough pyOpenSSL to support it, 
>>> Twisted can perform only rudimentary TLS client hostname verification.  
>>> Many valid certificate/hostname mappings may be rejected.
>>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
>>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Optional features available: 
>>> ssl, http11
>>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Overridden settings: 
>>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], 
>>> 'BOT_NAME': 'tm'}
>>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled extensions: LogStats, 
>>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled downloader middlewares: 
>>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
>>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
>>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
>>> ChunkedTransferMiddleware, DownloaderStats
>>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled spider middlewares: 
>>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>>> UrlLengthMiddleware, DepthMiddleware
>>> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled item pipelines: 
>>> 2015-04-02 14:56:01-0500 [tm] INFO: Spider opened
>>> 2015-04-02 14:56:01-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), 
>>> scraped 0 items (at 0 items/min)
>>> 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Telnet console listening on 
>>> 127.0.0.1:6023
>>> 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Web service listening on 
>>> 127.0.0.1:6080
>>> 2015-04-02 14:56:01-0500 [tm] DEBUG: Crawled (200) <GET 
>>> http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> 
>>> (referer: None)
>>> 2015-04-02 14:56:01-0500 [tm] INFO: Closing spider (finished)
>>> 2015-04-02 14:56:01-0500 [tm] INFO: Dumping Scrapy stats:
>>> {'downloader/request_bytes': 260,
>>>  'downloader/request_count': 1,
>>>  'downloader/request_method_count/GET': 1,
>>>  'downloader/response_bytes': 6234,
>>>  'downloader/response_count': 1,
>>>  'downloader/response_status_count/200': 1,
>>>  'finish_reason': 'finished',
>>>  'finish_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 861714),
>>>  'log_count/DEBUG': 3,
>>>  'log_count/INFO': 7,
>>>  'response_received_count': 1,
>>>  'scheduler/dequeued': 1,
>>>  'scheduler/dequeued/memory': 1,
>>>  'scheduler/enqueued': 1,
>>>  'scheduler/enqueued/memory': 1,
>>>  'start_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 494696)}
>>> 2015-04-02 14:56:01-0500 [tm] INFO: Spider closed (finished)
>>>
>>>
>>>
>>> On Thursday, April 2, 2015 at 11:30:41 AM UTC-5, Travis Leleu wrote:
>>>>
>>>> Python can't find the file whose path is stored in filename.  Used in 
>>>> line 13 of your spider.  Read your scrapy debug output to find out more 
>>>> information.
>>>>
>>>>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>>>>     with open(filename, 'wb') as f:
>>>> exceptions.IOError: [Errno 2] No such file or directory: ''
>>>>
>>>> On Wed, Apr 1, 2015 at 10:38 PM, Troy Perkins <[email protected]> 
>>>> wrote:
>>>>
>>>>> Greetings all:
>>>>>
>>>>> I'm new to scrapy and managed to get everything installed and 
>>>>> working.  However my simple test project has proven not so simple, at 
>>>>> least 
>>>>> for me.
>>>>>
>>>>> I'm simply wanting to request the home page of t 1 c k e t m a s t e r 
>>>>> d o t c o m, click the red Just Announced tab down the middle of the page 
>>>>> and -o the list of results out to an email address once a day via cron.  
>>>>> I 
>>>>> want to be able to keep up with the announcements because their mailing 
>>>>> lists simply don't send them soon enough.
>>>>>
>>>>> Here is my starting spider, which I've tested with other sites and its 
>>>>> works fine.  I believe the error is due to it being a javascript rendered 
>>>>> site.  I've used firebug to look for clues but I'm too new at this to 
>>>>> understand as well as understand javascript.  I'm hoping someone would be 
>>>>> willing to point this noob a direction.  I've also tried removing 
>>>>> middleware in the settings.py file with same results.
>>>>>
>>>>> I've purposely masked out the site address as though I don't mean any 
>>>>> harm, I'm not quite sure of their ToS as of yet.  I plan to poll once a 
>>>>> day 
>>>>> anyway for personal use.
>>>>>
>>>>> import scrapy
>>>>>
>>>>> from tm.items import TmItem
>>>>>
>>>>> class TmSpider(scrapy.Spider):
>>>>>    name = "tm"
>>>>>    allowed_domains = ["www.************.com"]
>>>>>    start_urls = [
>>>>>        "http://www.***********.com";
>>>>>    ]
>>>>>    def parse(self, response):
>>>>>        filename = response.url.split("/")[-2]
>>>>>        with open(filename, 'wb') as f:
>>>>>            f.write(response.body)
>>>>>
>>>>> scrapy crawl tm results in the following:
>>>>>
>>>>> :0: UserWarning: You do not have a working installation of the 
>>>>> service_identity module: 'No module named service_identity'.  Please 
>>>>> install it from <https://pypi.python.org/pypi/service_identity> and 
>>>>> make sure all of its dependencies are satisfied.  Without the 
>>>>> service_identity module and a recent enough pyOpenSSL to support it, 
>>>>> Twisted can perform only rudimentary TLS client hostname verification.  
>>>>> Many valid certificate/hostname mappings may be rejected.
>>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
>>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available: 
>>>>> ssl, http11
>>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings: 
>>>>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], 
>>>>> 'BOT_NAME': 'tm'}
>>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats, 
>>>>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader 
>>>>> middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, 
>>>>> UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, 
>>>>> MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, 
>>>>> CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
>>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares: 
>>>>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>>>>> UrlLengthMiddleware, DepthMiddleware
>>>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines: 
>>>>> 2015-04-02 00:30:12-0500 [tm] INFO: Spider opened
>>>>> 2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), 
>>>>> scraped 0 items (at 0 items/min)
>>>>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on 
>>>>> 127.0.0.1:6023
>>>>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on 
>>>>> 127.0.0.1:6080
>>>>> 2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET http://www.
>>>>> ****************com> (referer: None)
>>>>> 2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET 
>>>>> http://www.****************.com>
>>>>> Traceback (most recent call last):
>>>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", 
>>>>> line 1201, in mainLoop
>>>>>     self.runUntilCurrent()
>>>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", 
>>>>> line 824, in runUntilCurrent
>>>>>     call.func(*call.args, **call.kw)
>>>>>   File 
>>>>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
>>>>> line 383, in callback
>>>>>     self._startRunCallbacks(result)
>>>>>   File 
>>>>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
>>>>> line 491, in _startRunCallbacks
>>>>>     self._runCallbacks()
>>>>> --- <exception caught here> ---
>>>>>   File 
>>>>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", 
>>>>> line 578, in _runCallbacks
>>>>>     current.result = callback(current.result, *args, **kw)
>>>>>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>>>>>     with open(filename, 'wb') as f:
>>>>> exceptions.IOError: [Errno 2] No such file or directory: ''
>>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished)
>>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats:
>>>>> {'downloader/request_bytes': 219,
>>>>>  'downloader/request_count': 1,
>>>>>  'downloader/request_method_count/GET': 1,
>>>>>  'downloader/response_bytes': 73266,
>>>>>  'downloader/response_count': 1,
>>>>>  'downloader/response_status_count/200': 1,
>>>>>  'finish_reason': 'finished',
>>>>>  'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001),
>>>>>  'log_count/DEBUG': 3,
>>>>>  'log_count/ERROR': 1,
>>>>>  'log_count/INFO': 7,
>>>>>  'response_received_count': 1,
>>>>>  'scheduler/dequeued': 1,
>>>>>  'scheduler/dequeued/memory': 1,
>>>>>  'scheduler/enqueued': 1,
>>>>>  'scheduler/enqueued/memory': 1,
>>>>>  'spider_exceptions/IOError': 1,
>>>>>  'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)}
>>>>> 2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished)
>>>>>
>>>>>
>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "scrapy-users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: scrapy noob trying to do a simple data extraction (not so simple site)

Reply via email to