Re: [oae-dev] Making OAE indexable by Google

Simon Gaeremynck Fri, 03 Aug 2012 07:49:15 -0700

I might've been a bit too hasty with my conclusion. I was under the impression 
that the preview processor only generated PDFs of a page but it seems to do a 
whole lot more. The only contribution that I could see PhantomJS making 
(besides Google Indexing) is taking a screenshot of an external page.


I've  deployed a patch to qa21-us and have been testing how the page looks like 
when the Google Bot (not the Google spider/crawler!) [1] fetches the page and 
the first results are promising. I've submitted qa21-us for a full crawl and 
will report back later as to how good it handles everything.

Simon

[1] https://support.google.com/webmasters/bin/answer.py?hl=en&answer=158587

On 2 Aug 2012, at 23:20, Nate Angell <[email protected]> wrote:

> Thanks Simon!
> 
> Given the possible connections to preview processing, including generating 
> previews of sakai docs, we should make sure this conversation is hooked in to 
> the one on the port of the preview processor to java...there might be some 
> overlap.
> 
> = nate
> 
> On Thu, Aug 2, 2012 at 3:05 PM, Simon Gaeremynck <[email protected]> 
> wrote:
> Hey,
> 
> @Nate
> From doing a quick google search it looks like Bing and Yahoo support the 
> Google "standard" regarding hashbangs [1].
> Afaict that's anecdotal evidence though, so I'll need to have a look whether 
> that's actually the case.
> 
> As Nico mentioned, Google's Cache should probably be fine as the Filter 
> doesn't change the HTML in any way.
> 
> 
> @Christian
> From the tools I've tried it seems to be one of the better ones. They are 
> looking into improving the PDF generator which could possibly be an option to 
> replace the preview processor.
> I was under the impression that BSD would be compatible with the Sakai 
> License but it might actually be not. As IANAL somebody with a deeper 
> understanding of licensing should probably check. 
> 
> Regards,
> 
> Simon
> 
> 
> 
> [1] 
> http://searchengineland.com/bing-now-supports-googles-crawlable-ajax-standard-84149
> 
> On 2 Aug 2012, at 21:50, Nicolaas Matthijs 
> <[email protected]> wrote:
> 
>> Hi Nate,
>> 
>>> Would the solution you propose serve other search indexers as well, or just 
>>> Google?
>> 
>> This is a good question and hopefully Simon can jump in here. I believe the 
>> hashbang approach will only allow us to be indexed by Google, so perhaps 
>> it's worth exploring whether or not using the User Agents and detecting 
>> anything that's not a proper browser makes sense.
>> 
>>> I was also wondering what the effect of the solution you propose would be 
>>> on the cached versions of pages Google stores that are available via Google 
>>> search, eg:
>>> http://webcache.googleusercontent.com/search?q=cache:ELwAKCibsysJ:www.sakaiproject.org/+&cd=1&hl=en&ct=clnk&gl=us
>> 
>> I think this should be fine. We will return the HTML exactly like it would 
>> be rendered in a browser, so that will be shown as the cached version as 
>> well.
>> 
>> Hope that helps,
>> Nicolaas
>> 
>> 
>> 
>>> On Thu, Aug 2, 2012 at 11:03 AM, Simon Gaeremynck <[email protected]> 
>>> wrote:
>>> Hi all,
>>> 
>>> I've been working on KERN-3084 [1] which tries to add support for Google's 
>>> AJAX crawler [2].
>>> When Google notices you're using AJAX/Javascript to display content on your 
>>> page it sends a request to the server asking for a completely rendered 
>>> page. The idea is that we then run the page trough a headless browser and 
>>> sent that response back to Google.
>>> 
>>> I've created an implementation [3] [4] that does this but I'd like some 
>>> feedback before I send a PR.
>>> This commit would, much like the preview processor, bring in yet another 
>>> dependency. I'm using PhantomJS as it fires up a headless WebKit browser 
>>> and exposes a nice little nodejs api that you can (ab)use.
>>> I tried using the same toolset as the previewprocessor (wkhtmltopdf) but 
>>> that just seems to generate PDF's and doesn't allow access to the generated 
>>> DOM? 
>>> (PhantomJS supports PDF creation but it's nowhere near as good as 
>>> wkhtmltopdf though.)
>>> 
>>> 
>>> What's the feeling about this? Does anyone have a recommendation for a 
>>> better tool/approach?
>>> 
>>> Regards,
>>> 
>>> Simon
>>> 
>>> 
>>> 
>>> [1] https://jira.sakaiproject.org/browse/KERN-3084
>>> [2] 
>>> https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
>>> [3] 
>>> https://github.com/simong/nakamura/commit/83212d6fe814ee32be7dd3d9cd771c40dff6f69f
>>> [4] 
>>> https://confluence.sakaiproject.org/display/KERNDOC/KERN-3084+Making+OAE+indexable+by+Google
>>> [5] http://phantomjs.org/
>>> 
>>> _______________________________________________
>>> oae-dev mailing list
>>> [email protected]
>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>> 
>>> 
>>> _______________________________________________
>>> oae-dev mailing list
>>> [email protected]
>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>> 
> 
> 

_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] Making OAE indexable by Google

Reply via email to