Try a real USER_AGENT setting. Rolando
On Fri, Jun 3, 2016 at 3:58 PM, David Fishburn <[email protected]> wrote: > > Rolando, thanks so much for the help so far. > > I gave up on Windows, and decided to do this inside an Ubuntu 16.04 VM. > > This has been quite a learning experience, but I am stuck at one point, > where you obviously made it past. > > Here is what I have done so far. > > On my Linux box, I start docker: > > > > sudo docker run -p 5023:5023 -p 8050:8050 -p :8051 scrapinghub/splash > > curl ' > http://localhost:8050/render.html?url=https://sapui5.hanasdk/#docs/api/symbols/sap.html > ' > > 2016-06-03 18:45:30+0000 [-] Log opened. > 2016-06-03 18:45:30.977689 [-] Splash version: 2.1 > 2016-06-03 18:45:30.978469 [-] Qt 5.5.1, PyQt 5.5.1, WebKit 538.1, sip > 4.17, Twisted 16.1.1, Lua 5.2 > 2016-06-03 18:45:30.978726 [-] Python 3.4.3 (default, Oct 14 2015, 20:28: > 29) [GCC 4.8.4] > 2016-06-03 18:45:30.979088 [-] Open files limit: 1048576 > 2016-06-03 18:45:30.979314 [-] Can't bump open files limit > 2016-06-03 18:45:31.082894 [-] Xvfb is started: ['Xvfb', ':1', '-screen', > '0', '1024x768x24'] > 2016-06-03 18:45:31.156718 [-] proxy profiles support is enabled, proxy > profiles path: /etc/splash/proxy-profiles > 2016-06-03 18:45:31.249329 [-] verbosity=1 > 2016-06-03 18:45:31.249597 [-] slots=50 > 2016-06-03 18:45:31.254768 [-] argument_cache_max_entries=500 > 2016-06-03 18:45:31.255229 [-] Web UI: enabled, Lua: enabled (sandbox: > enabled) > 2016-06-03 18:45:31.257577 [-] Site starting on 8050 > 2016-06-03 18:45:31.257732 [-] Starting factory <twisted.web.server.Site > object at 0x7fb3b5ab7e48> > process 1: D-Bus library appears to be incorrectly set up; failed to read > machine uuid: Failed to open "/etc/machine-id": No such file or directory > See the manual page for dbus-uuidgen to correct this issue. > 2016-06-03 18:45:36.244542 [events] {"load": [0.29, 0.11, 0.07], > "client_ip": "172.17.0.1", "path": "/render.html", "timestamp": 1464979536, > "args": {"uid": 140409823866608, "url": " > https://sapui5.hana.ondemand.com/sdk/"}, "active": 0, "_id": > 140409823866608, "maxrss": 101772, "user-agent": "curl/7.47.0", > "rendertime": 0.539576530456543, "qsize": 0, "method": "GET", > "status_code": 200, "fds": 20} > 2016-06-03 18:45:36.244885 [-] "172.17.0.1" - - [03/Jun/2016:18:45:35 > +0000] "GET /render.html?url=https://sapui5.hana.ondemand.com/sdk/ > HTTP/1.1" 200 10270 "-" "curl/7.47.0" > > > All working perfectly. > > These are the steps I followed to create a brand new Scrapy project and > added Splash to the settings.py and the parse (as you did in your previous > post); > > > > Ubuntu already has Python 2.7, but now we need to get Scrapy. > > > pip is not installed as part of Ubuntu: > > > sudo apt-get -y install python-pip > sudo apt-get -y install libxml2-dev libxslt1-dev libffi-dev libssl-dev > python-dev > sudo pip install scrapy > > > Create a new blank scrapy project > root@ubuntu:/opt/scrapy# scrapy startproject ui5 > New Scrapy project 'ui5', using template directory > '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', > created in: > /opt/scrapy/ui5 > > > You can start your first spider with: > cd ui5 > scrapy genspider example example.com > > > > > root@ubuntu:/opt/scrapy# cd ui5 > root@ubuntu:/opt/scrapy/ui5# scrapy genspider sapui5 > sapui5.hana.ondemand.com > Created spider 'sapui5' using template 'basic' in module: > ui5.spiders.sapui5 > > > Now run the scrapy project with: > > > root@ubuntu:/opt/scrapy/ui5# scrapy crawl sapui5 > -- You can see here the URL is incorrect. > 2016-06-03 10:57:35 [scrapy] DEBUG: Retrying <GET http:// > www.sapui5.hana.ondemand.com/robots.txt> (failed 1 > > > root@ubuntu:/opt/scrapy/ui5# vim ui5/spiders/sapui5.py > -- Update the URL to be correct > class Sapui5Spider(scrapy.Spider): > name = "sapui5" > allowed_domains = ["sapui5.hana.ondemand.com"] > start_urls = ( > 'https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html/' > , > ) > > > def parse(self, response): > pass > > > root@ubuntu:/opt/scrapy/ui5# scrapy crawl sapui5 > -- Success, but nothing really done, as it uses iframes and other issues. > 2016-06-03 11:01:25 [scrapy] INFO: Enabled item pipelines: > [] > 2016-06-03 11:01:25 [scrapy] INFO: Spider opened > 2016-06-03 11:01:25 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), > scraped 0 items (at 0 items/min) > 2016-06-03 11:01:25 [scrapy] DEBUG: Telnet console listening on 127.0.0.1: > 6023 > 2016-06-03 11:01:25 [scrapy] DEBUG: Crawled (404) <GET https:// > sapui5.hana.ondemand.com/robots.txt> (referer: None) > 2016-06-03 11:01:25 [scrapy] DEBUG: Crawled (200) <GET https:// > sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html/> (referer: None) > 2016-06-03 11:01:25 [scrapy] INFO: Closing spider (finished) > 2016-06-03 11:01:25 [scrapy] INFO: Dumping Scrapy stats: > {'downloader/request_bytes': 458, > 'downloader/request_count': 2, > 'downloader/request_method_count/GET': 2, > 'downloader/response_bytes': 4105, > 'downloader/response_count': 2, > 'downloader/response_status_count/200': 1, > 'downloader/response_status_count/404': 1, > 'finish_reason': 'finished', > 'finish_time': datetime.datetime(2016, 6, 3, 18, 1, 25, 649281), > 'log_count/DEBUG': 3, > 'log_count/INFO': 7, > 'response_received_count': 2, > 'scheduler/dequeued': 1, > 'scheduler/dequeued/memory': 1, > 'scheduler/enqueued': 1, > 'scheduler/enqueued/memory': 1, > 'start_time': datetime.datetime(2016, 6, 3, 18, 1, 25, 48404)} > 2016-06-03 11:01:25 [scrapy] INFO: Spider closed (finished) > > > > *Now update the spider / parser to use Splash and the Docker container* > > > -- Edit the spider and update it > > root@ubuntu:/opt/scrapy/ui5# vim ui5/spiders/sapui5.py > # -*- coding: utf-8 -*- > import scrapy > from scrapy_splash import SplashRequest > > > > class Sapui5Spider(scrapy.Spider): > name = "sapui5" > start_urls = ['https://sapui5.hana.ondemand.com/'] > > > def parse(self, response): > url = ' > https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html' > yield SplashRequest(url, self.parse_page, > args={ > 'wait': 5., > 'iframes': True, > 'html': True, > }, > endpoint='render.json') > > > def parse_page(self, response): > iframe_html = response.data['childFrames'][0]['html'] > sel = scrapy.Selector(text=iframe_html) > for div in sel.css('#content .sectionItem'): > name = div.css('a::text').extract_first() > desc = div.css('.description::text').extract_first() or '' > print(': '.join([name, desc])) > > > #class Sapui5Spider(scrapy.Spider): > # name = "sapui5" > # allowed_domains = ["sapui5.hana.ondemand.com"] > # start_urls = ( > # 'https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html/ > ', > # ) > # > # def parse(self, response): > # pass > > > root@ubuntu:/opt/scrapy/ui5/ui5/spiders# scrapy runspider sapui5.py > -- You get an error since it doesn't know how to get to the Docker Splash > container > NameError: global name 'SplashRequest' is not defined > > *Enable Scrapy to use Splash as Middleware* > > <https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/> > https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/ > > > pip install scrapy-splash > > > root@ubuntu:/opt/scrapy/ui5/ui5# vim settings.py > > > # Enable or disable downloader middlewares > # See > http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html > #DOWNLOADER_MIDDLEWARES = { > # 'ui5.middlewares.MyCustomDownloaderMiddleware': 543, > #} > DOWNLOADER_MIDDLEWARES = { > 'scrapy_splash.SplashCookiesMiddleware': 723, > 'scrapy_splash.SplashMiddleware': 725, > } > SPLASH_URL = 'http://localhost:8050/' > DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' > HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' > > > -- The middleware needs to take precedence over HttpProxyMiddleware, > which by default is at position 750, so we set the middleware positions > to numbers below 750. > > > > Now, here the problem, I don't get the same output you listed: > > > root@ubuntu:/opt/scrapy/ui5/ui5/spiders# scrapy runspider sapui5.py > > > 2016-06-03 11:54:26 [scrapy] INFO: Scrapy 1.1.0 started (bot: ui5) > 2016-06-03 11:54:26 [scrapy] INFO: Overridden settings: { > 'NEWSPIDER_MODULE': 'ui5.spiders', 'ROBOTSTXT_OBEY': True, > 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', > 'SPIDER_MODULES': ['ui5.spiders'], 'BOT_NAME': 'ui5', 'HTTPCACHE_STORAGE': > 'scrapy_splash.SplashAwareFSCacheStorage'} > 2016-06-03 11:54:26 [scrapy] INFO: Enabled extensions: > ['scrapy.extensions.logstats.LogStats', > 'scrapy.extensions.telnet.TelnetConsole', > 'scrapy.extensions.corestats.CoreStats'] > 2016-06-03 11:54:26 [scrapy] INFO: Enabled downloader middlewares: > ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', > 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', > 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', > 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', > 'scrapy.downloadermiddlewares.retry.RetryMiddleware', > 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', > 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', > 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', > 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', > 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', > 'scrapy_splash.SplashCookiesMiddleware', > 'scrapy_splash.SplashMiddleware', > 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', > 'scrapy.downloadermiddlewares.stats.DownloaderStats'] > 2016-06-03 11:54:26 [scrapy] INFO: Enabled spider middlewares: > ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', > 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', > 'scrapy.spidermiddlewares.referer.RefererMiddleware', > 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', > 'scrapy.spidermiddlewares.depth.DepthMiddleware'] > 2016-06-03 11:54:26 [scrapy] INFO: Enabled item pipelines: > [] > 2016-06-03 11:54:26 [scrapy] INFO: Spider opened > 2016-06-03 11:54:26 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), > scraped 0 items (at 0 items/min) > 2016-06-03 11:54:26 [scrapy] DEBUG: Telnet console listening on 127.0.0.1: > 6023 > 2016-06-03 11:54:27 [scrapy] DEBUG: Crawled (404) <GET https:// > sapui5.hana.ondemand.com/robots.txt> (referer: None) > 2016-06-03 11:54:27 [scrapy] DEBUG: Crawled (200) <GET https:// > sapui5.hana.ondemand.com/> (referer: None) > 2016-06-03 11:54:27 [scrapy] DEBUG: Crawled (404) <GET > http://localhost:8050/robots.txt> > (referer: None) > 2016-06-03 11:54:31 [scrapy] DEBUG: Crawled (200) <GET https:// > sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html via > http://localhost:8050/render.json> (referer: None) > > 2016-06-03 11:54:32 [scrapy] ERROR: Spider error processing <GET https:// > sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html via > http://localhost:8050/render.json> (referer: None) > Traceback (most recent call last): > File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", > line 588, in _runCallbacks > current.result = callback(current.result, *args, **kw) > File "/opt/scrapy/ui5/ui5/spiders/sapui5.py", line 20, in parse_page > iframe_html = response.data['childFrames'][0]['html'] > IndexError: list index out of range > 2016-06-03 11:54:32 [scrapy] INFO: Closing spider (finished) > 2016-06-03 11:54:32 [scrapy] INFO: Dumping Scrapy stats: > {'downloader/request_bytes': 1365, > 'downloader/request_count': 4, > 'downloader/request_method_count/GET': 3, > 'downloader/request_method_count/POST': 1, > 'downloader/response_bytes': 18165, > 'downloader/response_count': 4, > 'downloader/response_status_count/200': 2, > 'downloader/response_status_count/404': 2, > 'finish_reason': 'finished', > 'finish_time': datetime.datetime(2016, 6, 3, 18, 54, 32, 163122), > 'log_count/DEBUG': 5, > 'log_count/ERROR': 1, > 'log_count/INFO': 7, > 'request_depth_max': 1, > 'response_received_count': 4, > 'scheduler/dequeued': 3, > 'scheduler/dequeued/memory': 3, > 'scheduler/enqueued': 3, > 'scheduler/enqueued/memory': 3, > 'spider_exceptions/IndexError': 1, > 'splash/render.json/request_count': 1, > 'splash/render.json/response_count/200': 1, > 'start_time': datetime.datetime(2016, 6, 3, 18, 54, 26, 822252)} > 2016-06-03 11:54:32 [scrapy] INFO: Spider closed (finished) > root@ubuntu:/opt/scrapy/ui5/ui5/spiders# > > > > > And the corresponding docker / splash output (taken a different times to > create this post): > > > > 2016-06-03 18:42:27+0000 [-] Log opened. > > 2016-06-03 18:42:27.165439 [-] Splash version: 2.1 > 2016-06-03 18:42:27.165891 [-] Qt 5.5.1, PyQt 5.5.1, WebKit 538.1, sip > 4.17, Twisted 16.1.1, Lua 5.2 > 2016-06-03 18:42:27.166142 [-] Python 3.4.3 (default, Oct 14 2015, 20:28: > 29) [GCC 4.8.4] > 2016-06-03 18:42:27.166252 [-] Open files limit: 1048576 > 2016-06-03 18:42:27.166431 [-] Can't bump open files limit > 2016-06-03 18:42:27.270648 [-] Xvfb is started: ['Xvfb', ':1', '-screen', > '0', '1024x768x24'] > 2016-06-03 18:42:27.359173 [-] proxy profiles support is enabled, proxy > profiles path: /etc/splash/proxy-profiles > 2016-06-03 18:42:27.446969 [-] verbosity=1 > 2016-06-03 18:42:27.447204 [-] slots=50 > 2016-06-03 18:42:27.447408 [-] argument_cache_max_entries=500 > 2016-06-03 18:42:27.447715 [-] Web UI: enabled, Lua: enabled (sandbox: > enabled) > 2016-06-03 18:42:27.449404 [-] Site starting on 8050 > 2016-06-03 18:42:27.449533 [-] Starting factory <twisted.web.server.Site > object at 0x7f5c8ed4ce48> > 2016-06-03 18:42:33.741040 [-] "172.17.0.1" - - [03/Jun/2016:18:42:33 > +0000] "GET /robots.txt HTTP/1.1" 404 153 "-" "Scrapy/1.1.0 (+ > http://scrapy.org)" > process 1: D-Bus library appears to be incorrectly set up; failed to read > machine uuid: Failed to open "/etc/machine-id": No such file or directory > See the manual page for dbus-uuidgen to correct this issue. > 2016-06-03 18:42:38.960163 [events] {"timestamp": 1464979358, > "rendertime": 5.214949131011963, "_id": 140035510107888, "fds": 19, > "active": 0, "client_ip": "172.17.0.1", "maxrss": 81264, "qsize": 0, > "user-agent": "Scrapy/1.1.0 (+http://scrapy.org)", "load": [0.01, 0.04, > 0.05], "status_code": 200, "path": "/render.json", "args": {"iframes": > true, "url": " > https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html", "html": > true, "headers": {"User-Agent": "Scrapy/1.1.0 (+http://scrapy.org)", > "Accept": > "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", > "Referer": "https://sapui5.hana.ondemand.com/", "Accept-Language": "en", > "Accept-Encoding": "gzip,deflate"}, "wait": 5.0, "uid": 140035510107888}, > "method": "POST"} > 2016-06-03 18:42:38.960617 [-] "172.17.0.1" - - [03/Jun/2016:18:42:38 > +0000] "POST /render.json HTTP/1.1" 200 13662 "-" "Scrapy/1.1.0 (+ > http://scrapy.org)" > > > > Do you see what step I missed, where "curl" works, but the spider does not? > > Thanks, > David > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
