Re: Can't get Google Crawler to index my GWT-based site no matter what I do. Help!

Benjamin Possolo Sun, 28 Oct 2012 22:26:40 -0700

On Sunday, October 28, 2012 8:50:27 AM UTC-7, Joseph Lust wrote:

> I see you're using places and the url tokens, are you using GWTP? There is 
> some built in crawler support there that you could use, or at least 
> investigate if you're not using GWTP.
>
> http://code.google.com/p/gwt-platform/wiki/CrawlerSupport
>
> SIncerely,
> Joseph
>


Thanks for responding!

Yes you are correct. I am using places and activities (but not GWTP). I use 
straight GWT throughout the entire app (including all the editor + 
validation stuff).

GWTP does include a canned filter for handling crawlability. I have a very 
similar one albeit slightly optimized.

/**
 * Special filter that adds support for Google crawling as outlined here 
 * ({@link 
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started}
 * 
 * @author Benjamin Possolo
 */
public class GoogleCrawlerFilter implements Filter {
 private static final Logger log = 
Logger.getLogger(GoogleCrawlerFilter.class.getName());

private static final ThreadLocal<WebClient> webClient = new 
ThreadLocal<WebClient>(){
@Override
protected WebClient initialValue() {
log.info("Instantiating headless browser");
WebClient wc = new WebClient(BrowserVersion.FIREFOX_3_6);
wc.setThrowExceptionOnScriptError(false);
wc.setThrowExceptionOnFailingStatusCode(false);
wc.setCssEnabled(false);
return wc;
};
};
 @Override
public void init(FilterConfig config) throws ServletException {}
 @Override
public void doFilter(ServletRequest request, ServletResponse response, 
FilterChain chain)
throws IOException, ServletException {
 HttpServletRequest req = (HttpServletRequest)request;
HttpServletResponse resp = (HttpServletResponse)response;
String queryString = req.getQueryString();
 if( queryString != null && queryString.contains("_escaped_fragment_") ){
 log.info("Detected request from Google Crawler");
 //google requests the URL with the place fragment as a query parameter.
//they do this because URL fragments (the portion after the hash #) are
//not sent with an HTTP request.
//convert the ugly URL to the real url that uses the hashbang
queryString = queryString.replaceFirst("&?_escaped_fragment_=", "#!");
queryString = URLDecoder.decode(queryString, "UTF-8");
 StringBuilder pageToCrawlSb = new 
StringBuilder(req.getScheme()).append("://").append(req.getServerName());
if( req.getServerPort() > 0 )
pageToCrawlSb.append(':').append(req.getServerPort());
pageToCrawlSb.append(req.getRequestURI());
if( ! queryString.startsWith("#!") )
pageToCrawlSb.append('?');
pageToCrawlSb.append(queryString);
 String pageToCrawl = pageToCrawlSb.toString();
log.log(Level.INFO, "Page being crawled: {0}", pageToCrawl);
 //check if a snapshot of the requested page already exists
String htmlSnapshot = MemcacheUtil.getHtmlSnapshot(pageToCrawl);
if( htmlSnapshot == null ){
try{
//use HtmlUnit to render the requested page
long start = System.currentTimeMillis();
log.info("Using headless browser to fetch page");
HtmlPage page = webClient.get().getPage(pageToCrawl);
log.info("Pumping javascript event loop for 8 seconds");
webClient.get().getJavaScriptEngine().pumpEventLoop(8000); //execute 
javascript for 8 seconds
long end = System.currentTimeMillis();
log.log(Level.INFO, "Time to generate page snapshot: {0} seconds", ((end - 
start) / 1000L));
 //we add a special message to the top of the page so that anyone seeing 
the snapshot will
//know it is meant for Google crawling
String snapshotMsg = new StringBuilder("<body>\n\n")
.append("<hr />\n")
.append("<center>\n")
.append("  <h3>\n")
.append("    You are viewing a non-interactive page that is intended for 
the crawler.<br/>\n")
.append("    You probably want to see this page: <a href=\"" + pageToCrawl 
+ "\">" + pageToCrawl + "</a>\n")
.append("  </h3>\n")
.append("</center>\n")
.append("<hr />\n")
.toString();
htmlSnapshot = page.asXml();
htmlSnapshot = htmlSnapshot.replaceFirst("<body[^>]*>", snapshotMsg);
 //store the rendered page in memcache
MemcacheUtil.putHtmlSnapshot(pageToCrawl, htmlSnapshot);
}
finally{
webClient.get().closeAllWindows();
}
}
 //send the html snapshot back to the crawler
resp.setContentType("text/html; charset=UTF-8");
PrintWriter writer = resp.getWriter();
writer.print(htmlSnapshot);
}
else{
chain.doFilter(request, response);
}
}

@Override
public void destroy() {
//this is never called on Google App Engine
}
}

-- 
You received this message because you are subscribed to the Google Groups 
"Google Web Toolkit" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-web-toolkit/-/00Zmn6IfmrgJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-web-toolkit?hl=en.

Re: Can't get Google Crawler to index my GWT-based site no matter what I do. Help!

Reply via email to