OK I posted the 3rd post about CLD, this time testing perf by
comparing to Tika and language-detection (Google Code project):

    
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

Net/net all three do very well (>= 97% accuracy); I had to remove 4
languages from consideration because we don't support them.

Tika seems to have a lot of trouble with Spanish (confuses w/
Galician) and Danish (confuses with Dutch).

Also, Tika's performance is substantially slow than the other two... not
sure what's up.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
> <kkrugler_li...@transpac.com> wrote:
>
>> Sounds like a great idea - see the recent comment thread on 
>> https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
>>
>> And there's also https://issues.apache.org/jira/browse/TIKA-539
>
> Those do look related (if you swap charset in for language)!
>
> It's tricky to know just how much to "trust" what the server
> (Content-Type HTTP header) and content (http-equiv meta tag) says,
> though I do like CLD's approach: they never fully "trust" what was
> declared but rather use the declaration as a hint to boost language
> priors.
>
> And then to figure out what priors to assign for each hint they have
> these tables trained from a large content set (10% of Base).
>
> If we have access to a biggish crawl we could presumably do something
> similar, ie record how often the hint is wrong and translate that into
> appropriate prior boosts, ie make it a hint instead of fully trusting
> it.
>
> Does anyone know how ICU translates the encoding "hint" into priors
> for each encoding?
>
>> Also, what will you be using to test language detection? WIkipedia pages?
>
> I'm using the corpus from here:
>
>    
> http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/
>
> It's a random subset of europarl (1000 strings from each of 21 langs).
>
> Wikipedia would be great too!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

Reply via email to