Re: PhantomJS Downloader Middleware

sara Thu, 20 Aug 2015 06:54:26 -0700

Hi David, 

I just implemented the approach you mentioned as I have the similar 
requirement (needed to scrape dynamic data as well), so using 
execute_script() method of the driver to do some processing. Scrapy is 
setup to keep crawling the extracted links until CLOSESPIDER_TIMEOUT limit. 
This works just fine but sometimes the driver does not respond (MW process 
stucks randomly). Not sure exactly if the driver has some issues or scrapy 
becomes unable to handle the requests/responses being passed/processed in 
the MW. 
Here is the setting of the spider:


DOWNLOADER_MIDDLEWARES = {
        'keywords.phantomMiddleware.PhantomJsMiddleware': 99
    }
#AutoThrottle Settings.
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 3
AUTOTHROTTLE_DEBUG = True
AUTOTHROTTLE_MAX_DELAY = 60
CLOSESPIDER_TIMEOUT = 21600

Any thoughts?

On Thursday, 14 May 2015 19:13:09 UTC+5, Joey Espinosa wrote:
>
> David,
>
> I've written middleware to intercept a JS-specific request before it is 
> processed. I haven't used WaitFor.js, so I can't help you there, but I can 
> help get you started with PhantomJS.
>
>     class JSMiddleware(BaseMiddleware):
>         def process_request(self, request, spider):
>             if request.meta.get('js'): # you probably want a conditional 
> trigger
>                 driver = webdriver.PhantomJS()
>                 driver.get(request.url)
>                 body = driver.page_source
>                 return HtmlResponse(driver.current_url, body=body, 
> encoding='utf-8', request=request)
>             return
>
> That's the simplest approach. You may want to end up adding options to the 
> webdriver.PhantomJS() call, such as desired_capabilities including SSL 
> processing options or a user agent string. You may also want to wrap the 
> driver.get() call in a try/except block. Additionally, you should do 
> something with the cookies that come back from PhantomJS using 
> driver.get_cookies().
>
> Also, if you want every request to go through JS, then you can remove the 
> request.meta['js'] conditional. Otherwise, you could insert that 
> information for initial requests in a spider.make_requests_from_url 
> override, or you could simply have a spider instance method like 
> spider.run_js(request) where the spider can look at the request and 
> determine whether it needs JS on it based on some criteria you come up with.
>
> There are a lot of options for you with PhantomJS, so it's really up to 
> you, but this should be a decent starting point. I hope this answers your 
> question.
>
> --
> Respectfully,
>
> Joey Espinosa
> http://about.me/joelinux
>
>
> On Thu, May 14, 2015 at 9:57 AM David Fishburn <dfishb...@gmail.com 
> <javascript:>> wrote:
>
>> Thanks for the response José.  
>>
>> That integrates Splash as the JS renderer.  From the documentation I have 
>> read, it looks like Splash does not support Windows.
>>
>> David
>>
>>
>> On Thursday, May 14, 2015 at 12:24:08 AM UTC-4, José Ricardo wrote:
>>
>>> Hi David, have you given ScrapyJS a try?
>>>
>>> https://github.com/scrapinghub/scrapyjs
>>>
>>> Besides rendering the page, it can also take screenshots :)
>>>
>>> Regards,
>>>
>>> José
>>>
>> On Wed, May 13, 2015 at 3:54 PM, Travis Leleu <m...@travisleleu.com> 
>>> wrote:
>>>
>> Hi David,
>>>>
>>>> Honestly, I have yet to find a good integration with scrapy / JS 
>>>> browser.  The current methods seem to all download the basic page via 
>>>> urllib3, then send that HTML to render and fetch other resources.
>>>>
>>>> This causes a bottleneck -- the browser process, usually exposed via an 
>>>> API, takes a lot of CPU / time to render the page.  It also doesn't easily 
>>>> use proxies, which means that all subsequent requests will be from one IP 
>>>> address.
>>>>
>>>> I think it would be a lot of work to build this into scrapy.
>>>>
>>>> In my work, I tend to just write my own (scaled down) scraping engine 
>>>> that works more directly with a headless js browser.
>>>>
>>> On Wed, May 13, 2015 at 12:32 PM, David Fishburn <dfishb...@gmail.com> 
>>>> wrote:
>>>>
>>> I am new to Scrapy and Python.
>>>>>
>>>>> I have a site I need to scrap but it is all AJAX driven, so will need 
>>>>> something like PhantomJS to yield the final page rendering.
>>>>>
>>>>> I have been searching in vain really for a simple example of a 
>>>>> downloader middleware which uses PhantomJS.  It has been around long 
>>>>> enough 
>>>>> that I am sure someone has already written one.  I can find complete 
>>>>> projects for Splash and others, but I am on Windows.
>>>>>
>>>>> It doesn't need to be fancy, just take the Scrapy request and return 
>>>>> the PhantomJS page (most likely using the WaitFor.js, which the PhantomJS 
>>>>> dev team wrote, to only return the page after it has stopped making AJAX 
>>>>> calls).
>>>>>
>>>>> I am completely lost trying to get started.  The documentation (
>>>>> http://doc.scrapy.org/en/latest/topics/downloader-middleware.html) 
>>>>> talks about the APIs, but they don't give a basic application which I 
>>>>> could 
>>>>> begin modifying to plugin the PhantomJS calls which I have shown below 
>>>>> (which are very simple).
>>>>>
>>>>> Anyone have something I can use?
>>>>>
>>>>> This code does what I want when using the Scrapy shell:
>>>>>
>>>>>
>>>>> D:\Python27\Scripts\scrapy.exe shell 
>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html
>>>>>
>>>>> >>>from selenium import webdriver
>>>>> >>>driver = webdriver.PhantomJS()
>>>>> >>>driver.set_window_size(1024, 768)
>>>>> >>>driver.get('
>>>>> https://sapui5.netweaver.ondemand.com/sdk/#docs/api/symbols/sap.html')
>>>>> -- Wait here for a 30 seconds and let the AJAX calls finish
>>>>> >>>driver.save_screenshot('screen.png')
>>>>> >>>print driver.page_source
>>>>> >>>driver.quit()
>>>>>
>>>>>
>>>>> The screen shot contains a properly rendered browser.
>>>>>
>>>>>
>>>>> Thanks for any advice you can give.
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "scrapy-users" group.
>>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to scrapy-users...@googlegroups.com.
>>>>> To post to this group, send email to scrapy...@googlegroups.com.
>>>>
>>>>
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "scrapy-users" group.
>>>>
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to scrapy-users...@googlegroups.com.
>>>> To post to this group, send email to scrapy...@googlegroups.com.
>>>>
>>>
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to scrapy-users...@googlegroups.com <javascript:>.
>> To post to this group, send email to scrapy...@googlegroups.com 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: PhantomJS Downloader Middleware

Reply via email to