Hi Mike,

Sounds like a great idea - see the recent comment thread on 
https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.

And there's also https://issues.apache.org/jira/browse/TIKA-539

Also, what will you be using to test language detection? WIkipedia pages?

-- Ken

On Oct 24, 2011, at 7:29pm, Michael McCandless wrote:

> I've only scratched the surface in figuring out how CLD
> works... excising the code and exposing a Python wrapper is much
> easier than actually understanding it!
> 
> It has some neat features, like passing in three possible "hints":
> 
>  * domain extension (fr boosts French)
> 
>  * declared encoding
> 
>  * declared language
> 
> It uses these hints to set pre-computed priors for top 3 languages.
> 
> It can optionally "abstain" from guessing if it thinks it's not very
> confident for certain matches.  It has an overall "reliable" bool that
> comes back, which is true if the match is high confidence (like Tika's
> isReasonablyCertain, though that's per-match).
> 
> But, you can't [easily] limit up front the set of languages to test
> like you can with Tika (I think?  You can just .addProfile() for each
> language you want?  Hmm though loading a LanguageProfile from a .ngp
> file looks like it's private inside LanguageIdentifier).
> 
> I'm trying to test Tika vs CLD vs the java language detect library
> (http://code.google.com/p/language-detection)... hoping to finish that
> soon and do a followon blog post.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Mon, Oct 24, 2011 at 9:45 AM, Ken Krugler
> <kkrugler_li...@transpac.com> wrote:
>> I took a quick look just now, though it's not really documented yet - in the 
>> process of being separated from inside of Chrome.
>> 
>> But looks like they store pre-calculated compression models for languages, 
>> and then figure out which model works best on the text being analyzed (which 
>> implies it has bytes with similar probabilistic distribution/sequencing).
>> 
>> -- Ken
>> 
>> On Oct 24, 2011, at 3:18pm, Jérôme Charron wrote:
>> 
>>> Hi,
>>> 
>>> I just find this blog post from Mike McCandless about Google's Compact
>>> Language Detection code used in Chrome :
>>> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
>>> 
>>> There's probably some interesting things to explore in the Google Code in
>>> order to improve Tika's Language Detection.
>>> Did someone allready take a look at Google CLD code ?
>>> http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/
>>> 
>>> Best regards
>>> 
>>> Jérôme
>>> 
>>> --
>>> @jcharron
>>> http://motre.ch/
>>> http://jcharron.posterous.com/
>>> http://www.shopreflex.fr/
>>> http://www.staragora.com/
>>> 
>>> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1>
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



Reply via email to