>
> Try a real USER_AGENT setting.
>
>
>
>  I added the following to settings.py, which I pulled from Fiddler on my 
Windows desktop:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'



Same result:




2016-06-03 20:37:19.777710 [-] Splash version: 2.1
2016-06-03 20:37:19.781558 [-] Qt 5.5.1, PyQt 5.5.1, WebKit 538.1, sip 4.17, 
Twisted 16.1.1, Lua 5.2
2016-06-03 20:37:19.781864 [-] Python 3.4.3 (default, Oct 14 2015, 20:28:29) 
[GCC 4.8.4]
2016-06-03 20:37:19.782499 [-] Open files limit: 1048576
2016-06-03 20:37:19.782676 [-] Can't bump open files limit
2016-06-03 20:37:19.903300 [-] Xvfb is started: ['Xvfb', ':1', '-screen', '0', 
'1024x768x24']
2016-06-03 20:37:20.115657 [-] proxy profiles support is enabled, proxy 
profiles path: /etc/splash/proxy-profiles
2016-06-03 20:37:20.319444 [-] verbosity=1
2016-06-03 20:37:20.319719 [-] slots=50
2016-06-03 20:37:20.320095 [-] argument_cache_max_entries=500
2016-06-03 20:37:20.320618 [-] Web UI: enabled, Lua: enabled (sandbox: 
enabled)
2016-06-03 20:37:20.323905 [-] Site starting on 8050
2016-06-03 20:37:20.324129 [-] Starting factory <twisted.web.server.Site 
object at 0x7f279ce6fe48>
2016-06-03 20:37:24.992726 [-] "172.17.0.1" - - [03/Jun/2016:20:37:24 
+0000] "GET /robots.txt HTTP/1.1" 404 153 "-" "Mozilla/5.0 (Windows NT 
10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/50.0.2661.102 Safari/537.36"
process 1: D-Bus library appears to be incorrectly set up; failed to read 
machine uuid: Failed to open "/etc/machine-id": No such file or directory
See the manual page for dbus-uuidgen to correct this issue.
2016-06-03 20:37:35.964753 [events] {"timestamp": 1464986255, "_id": 
139808112914160, "active": 0, "args": {"iframes": true, "headers": 
{"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36", 
"Referer": "https://sapui5.hana.ondemand.com/";, "Accept-Language": "en", 
"Accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
"Accept-Encoding": "gzip,deflate"}, "uid": 139808112914160, "wait": 10.0, 
"html": true, "url": 
"https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html"}, 
"maxrss": 79608, "fds": 19, "client_ip": "172.17.0.1", "path": 
"/render.json", "status_code": 200, "rendertime": 10.95194411277771, 
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36", "method": "POST", 
"qsize": 0, "load": [0.09, 0.08, 0.06]}
2016-06-03 20:37:35.967636 [-] "172.17.0.1" - - [03/Jun/2016:20:37:35 
+0000] "POST /render.json HTTP/1.1" 200 13662 "-" "Mozilla/5.0 (Windows NT 
10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/50.0.2661.102 Safari/537.36"
 

Instead of showing the error that the Index is out of bounds, I changed 
parse_page to:

    def parse_page(self, response):
        if 'response' in locals():
            print('response is defined')
        else:
            print('Ooops: response is not defined')
            return
        print response
        if 'response.data' in locals():
            print('response.data is defined')
        else:
            print('Ooops: response.data is not defined')
            return
        print response.data
        print('Len response.data:'.len(response.data))
        if 'childFrames' in response.data.keys():
            print('There is a childFrame')
        else:
            print('Ooops: no childFrames')
            return
        if len(response.data['childFrames']) > 0:
            print('There is childFrame 0')
        else:
            print('Ooops: no childFrame 0')
            return
        print('Len first child:'.len(response.data['childFrames'][0]))
        print('Len html:'.len(response.data['childFrames'][0]['html']))
        iframe_html = response.data['childFrames'][0]['html']


And get response.data is not defined.

2016-06-03 13:37:24 [scrapy] DEBUG: Crawled (200) <GET 
https://sapui5.hana.ondemand.com/> 
(referer: None)
2016-06-03 13:37:24 [scrapy] DEBUG: Crawled (404) <GET 
http://localhost:8050/robots.txt> 
(referer: None)
2016-06-03 13:37:36 [scrapy] DEBUG: Crawled (200) <GET 
https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html 
via http://localhost:8050/render.json> (referer: None)
response is defined
<200 https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html>
Ooops: response.data is not defined
2016-06-03 13:37:36 [scrapy] INFO: Closing spider (finished)



Not really sure what that is indicating.

Is that pointing to a Splash problem?
As html should have been returned, just as it was from curl.

Just trying to figure out where the moving parts are to drill into.

Thanks,
David



-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to