I might've been a bit too hasty with my conclusion. I was under the impression that the preview processor only generated PDFs of a page but it seems to do a whole lot more. The only contribution that I could see PhantomJS making (besides Google Indexing) is taking a screenshot of an external page.
I've deployed a patch to qa21-us and have been testing how the page looks like when the Google Bot (not the Google spider/crawler!) [1] fetches the page and the first results are promising. I've submitted qa21-us for a full crawl and will report back later as to how good it handles everything. Simon [1] https://support.google.com/webmasters/bin/answer.py?hl=en&answer=158587 On 2 Aug 2012, at 23:20, Nate Angell <[email protected]> wrote: > Thanks Simon! > > Given the possible connections to preview processing, including generating > previews of sakai docs, we should make sure this conversation is hooked in to > the one on the port of the preview processor to java...there might be some > overlap. > > = nate > > On Thu, Aug 2, 2012 at 3:05 PM, Simon Gaeremynck <[email protected]> > wrote: > Hey, > > @Nate > From doing a quick google search it looks like Bing and Yahoo support the > Google "standard" regarding hashbangs [1]. > Afaict that's anecdotal evidence though, so I'll need to have a look whether > that's actually the case. > > As Nico mentioned, Google's Cache should probably be fine as the Filter > doesn't change the HTML in any way. > > > @Christian > From the tools I've tried it seems to be one of the better ones. They are > looking into improving the PDF generator which could possibly be an option to > replace the preview processor. > I was under the impression that BSD would be compatible with the Sakai > License but it might actually be not. As IANAL somebody with a deeper > understanding of licensing should probably check. > > Regards, > > Simon > > > > [1] > http://searchengineland.com/bing-now-supports-googles-crawlable-ajax-standard-84149 > > On 2 Aug 2012, at 21:50, Nicolaas Matthijs > <[email protected]> wrote: > >> Hi Nate, >> >>> Would the solution you propose serve other search indexers as well, or just >>> Google? >> >> This is a good question and hopefully Simon can jump in here. I believe the >> hashbang approach will only allow us to be indexed by Google, so perhaps >> it's worth exploring whether or not using the User Agents and detecting >> anything that's not a proper browser makes sense. >> >>> I was also wondering what the effect of the solution you propose would be >>> on the cached versions of pages Google stores that are available via Google >>> search, eg: >>> http://webcache.googleusercontent.com/search?q=cache:ELwAKCibsysJ:www.sakaiproject.org/+&cd=1&hl=en&ct=clnk&gl=us >> >> I think this should be fine. We will return the HTML exactly like it would >> be rendered in a browser, so that will be shown as the cached version as >> well. >> >> Hope that helps, >> Nicolaas >> >> >> >>> On Thu, Aug 2, 2012 at 11:03 AM, Simon Gaeremynck <[email protected]> >>> wrote: >>> Hi all, >>> >>> I've been working on KERN-3084 [1] which tries to add support for Google's >>> AJAX crawler [2]. >>> When Google notices you're using AJAX/Javascript to display content on your >>> page it sends a request to the server asking for a completely rendered >>> page. The idea is that we then run the page trough a headless browser and >>> sent that response back to Google. >>> >>> I've created an implementation [3] [4] that does this but I'd like some >>> feedback before I send a PR. >>> This commit would, much like the preview processor, bring in yet another >>> dependency. I'm using PhantomJS as it fires up a headless WebKit browser >>> and exposes a nice little nodejs api that you can (ab)use. >>> I tried using the same toolset as the previewprocessor (wkhtmltopdf) but >>> that just seems to generate PDF's and doesn't allow access to the generated >>> DOM? >>> (PhantomJS supports PDF creation but it's nowhere near as good as >>> wkhtmltopdf though.) >>> >>> >>> What's the feeling about this? Does anyone have a recommendation for a >>> better tool/approach? >>> >>> Regards, >>> >>> Simon >>> >>> >>> >>> [1] https://jira.sakaiproject.org/browse/KERN-3084 >>> [2] >>> https://developers.google.com/webmasters/ajax-crawling/docs/getting-started >>> [3] >>> https://github.com/simong/nakamura/commit/83212d6fe814ee32be7dd3d9cd771c40dff6f69f >>> [4] >>> https://confluence.sakaiproject.org/display/KERNDOC/KERN-3084+Making+OAE+indexable+by+Google >>> [5] http://phantomjs.org/ >>> >>> _______________________________________________ >>> oae-dev mailing list >>> [email protected] >>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev >>> >>> >>> _______________________________________________ >>> oae-dev mailing list >>> [email protected] >>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev >> > > _______________________________________________ oae-dev mailing list [email protected] http://collab.sakaiproject.org/mailman/listinfo/oae-dev
