Re: Google's Compact Language Detector

Jérôme Charron Tue, 25 Oct 2011 09:54:10 -0700

Thanks Mike for sharing these tests.
There is clearly a performance issue regarding Tika run time.
As you noticed it, it will be interesting to see if the accuracy can be
increased by mixing the languages profiles of many libraries.
But not sure if the accuracy is depending only from the languages profiles
and not the algorithm too...



On Tue, Oct 25, 2011 at 18:12, Michael McCandless <luc...@mikemccandless.com
> wrote:

> OK I posted the 3rd post about CLD, this time testing perf by
> comparing to Tika and language-detection (Google Code project):
>
>
> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>
> Net/net all three do very well (>= 97% accuracy); I had to remove 4
> languages from consideration because we don't support them.
>
> Tika seems to have a lot of trouble with Spanish (confuses w/
> Galician) and Danish (confuses with Dutch).
>
> Also, Tika's performance is substantially slow than the other two... not
> sure what's up.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
> > On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
> > <kkrugler_li...@transpac.com> wrote:
> >
> >> Sounds like a great idea - see the recent comment thread on
> https://issues.apache.org/jira/browse/TIKA-431 for some related
> discussions.
> >>
> >> And there's also https://issues.apache.org/jira/browse/TIKA-539
> >
> > Those do look related (if you swap charset in for language)!
> >
> > It's tricky to know just how much to "trust" what the server
> > (Content-Type HTTP header) and content (http-equiv meta tag) says,
> > though I do like CLD's approach: they never fully "trust" what was
> > declared but rather use the declaration as a hint to boost language
> > priors.
> >
> > And then to figure out what priors to assign for each hint they have
> > these tables trained from a large content set (10% of Base).
> >
> > If we have access to a biggish crawl we could presumably do something
> > similar, ie record how often the hint is wrong and translate that into
> > appropriate prior boosts, ie make it a hint instead of fully trusting
> > it.
> >
> > Does anyone know how ICU translates the encoding "hint" into priors
> > for each encoding?
> >
> >> Also, what will you be using to test language detection? WIkipedia
> pages?
> >
> > I'm using the corpus from here:
> >
> >
> http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/
> >
> > It's a random subset of europarl (1000 strings from each of 21 langs).
> >
> > Wikipedia would be great too!
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
>



-- 
--------
@jcharron <http://www.twitter.com/jcharron>
http://motre.ch/
http://jcharron.posterous.com/
http://www.shopreflex.fr/
http://www.staragora.com/

<http://feeds.feedburner.com/~r/Bligblagblog/~6/1>

Re: Google's Compact Language Detector

Reply via email to