I took a quick look just now, though it's not really documented yet - in the process of being separated from inside of Chrome.
But looks like they store pre-calculated compression models for languages, and then figure out which model works best on the text being analyzed (which implies it has bytes with similar probabilistic distribution/sequencing). -- Ken On Oct 24, 2011, at 3:18pm, Jérôme Charron wrote: > Hi, > > I just find this blog post from Mike McCandless about Google's Compact > Language Detection code used in Chrome : > http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html > > There's probably some interesting things to explore in the Google Code in > order to improve Tika's Language Detection. > Did someone allready take a look at Google CLD code ? > http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/ > > Best regards > > Jérôme > > -- > @jcharron > http://motre.ch/ > http://jcharron.posterous.com/ > http://www.shopreflex.fr/ > http://www.staragora.com/ > > <http://feeds.feedburner.com/~r/Bligblagblog/~6/1> -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr