I took a quick look just now, though it's not really documented yet - in the 
process of being separated from inside of Chrome.

But looks like they store pre-calculated compression models for languages, and 
then figure out which model works best on the text being analyzed (which 
implies it has bytes with similar probabilistic distribution/sequencing).

-- Ken

On Oct 24, 2011, at 3:18pm, Jérôme Charron wrote:

> Hi,
> 
> I just find this blog post from Mike McCandless about Google's Compact
> Language Detection code used in Chrome :
> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
> 
> There's probably some interesting things to explore in the Google Code in
> order to improve Tika's Language Detection.
> Did someone allready take a look at Google CLD code ?
> http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/
> 
> Best regards
> 
> Jérôme
> 
> -- 
> @jcharron
> http://motre.ch/
> http://jcharron.posterous.com/
> http://www.shopreflex.fr/
> http://www.staragora.com/
> 
> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



Reply via email to