On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
<kkrugler_li...@transpac.com> wrote:

> Sounds like a great idea - see the recent comment thread on 
> https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
>
> And there's also https://issues.apache.org/jira/browse/TIKA-539

Those do look related (if you swap charset in for language)!

It's tricky to know just how much to "trust" what the server
(Content-Type HTTP header) and content (http-equiv meta tag) says,
though I do like CLD's approach: they never fully "trust" what was
declared but rather use the declaration as a hint to boost language
priors.

And then to figure out what priors to assign for each hint they have
these tables trained from a large content set (10% of Base).

If we have access to a biggish crawl we could presumably do something
similar, ie record how often the hint is wrong and translate that into
appropriate prior boosts, ie make it a hint instead of fully trusting
it.

Does anyone know how ICU translates the encoding "hint" into priors
for each encoding?

> Also, what will you be using to test language detection? WIkipedia pages?

I'm using the corpus from here:

    
http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/

It's a random subset of europarl (1000 strings from each of 21 langs).

Wikipedia would be great too!

Mike McCandless

http://blog.mikemccandless.com

Reply via email to