You may need to use HTMLUnit for this:

 // Setup the headless browser
 WebClient webClient = new WebClient();
 webClient.setWebConnection(new UrlFetchWebConnection(webClient));

HtmlPage page = webClient.getPage(target);
//gae hack because its single threaded
webClient.getJavaScriptEngine().pumpEventLoop(PUMP_TIME);
pageString = page.asXml();


You can check my implementation and POM configuration here:  
http://bit.ly/10eRef0

I've tested it to work with GAE, as well as with JBoss CapeDwarf, the only 
problem I had was when I was fetching a big site, those with lots of 
resources, where I get time out exception as you know GAE has constraint in 
front-end run code run. 

Cheers.

On Sunday, May 5, 2013 12:46:13 AM UTC+8, Phil wrote:
>
> I'm trying to grab in the html from a web page. I think the standard GAE 
> way to do this is to use UrlFetch. I'm running into an issue that the page 
> I'm grabbing loads much of it's content asynchronously. Is there anyway to 
> have UrlFetch grab the html that loads via javascript?
>
> Specifically, I'm trying to grab this page: 
> http://www.groupon.com/browse/san-francisco?category=restaurants-and-bars
>
> Any idea how to get the html that loads via js in the middle of the page?
>
> Thanks,
> Phil
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to