I've only scratched the surface in figuring out how CLD
works... excising the code and exposing a Python wrapper is much
easier than actually understanding it!

It has some neat features, like passing in three possible "hints":

  * domain extension (fr boosts French)

  * declared encoding

  * declared language

It uses these hints to set pre-computed priors for top 3 languages.

It can optionally "abstain" from guessing if it thinks it's not very
confident for certain matches.  It has an overall "reliable" bool that
comes back, which is true if the match is high confidence (like Tika's
isReasonablyCertain, though that's per-match).

But, you can't [easily] limit up front the set of languages to test
like you can with Tika (I think?  You can just .addProfile() for each
language you want?  Hmm though loading a LanguageProfile from a .ngp
file looks like it's private inside LanguageIdentifier).

I'm trying to test Tika vs CLD vs the java language detect library
(http://code.google.com/p/language-detection)... hoping to finish that
soon and do a followon blog post.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Oct 24, 2011 at 9:45 AM, Ken Krugler
<kkrugler_li...@transpac.com> wrote:
> I took a quick look just now, though it's not really documented yet - in the 
> process of being separated from inside of Chrome.
>
> But looks like they store pre-calculated compression models for languages, 
> and then figure out which model works best on the text being analyzed (which 
> implies it has bytes with similar probabilistic distribution/sequencing).
>
> -- Ken
>
> On Oct 24, 2011, at 3:18pm, Jérôme Charron wrote:
>
>> Hi,
>>
>> I just find this blog post from Mike McCandless about Google's Compact
>> Language Detection code used in Chrome :
>> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
>>
>> There's probably some interesting things to explore in the Google Code in
>> order to improve Tika's Language Detection.
>> Did someone allready take a look at Google CLD code ?
>> http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/
>>
>> Best regards
>>
>> Jérôme
>>
>> --
>> @jcharron
>> http://motre.ch/
>> http://jcharron.posterous.com/
>> http://www.shopreflex.fr/
>> http://www.staragora.com/
>>
>> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>

Reply via email to