Re: AW: N-gram layer and language guessing

karl wettin Tue, 03 Feb 2004 04:58:38 -0800

On Tue, 03 Feb 2004 12:47:06 +0100
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Karsten Konrad wrote:
> > The guesser uses only tri- and quad-grams and is based on
> > a sophisticated machine learning algorithm instead of a raw
> > TF/IDF-weighting. The upside of this is the "confidence" 
> > value for estimating how much you can trust the 
> > classification. The downside is the model size: 5MB for 15 
> > languages, which comes mostly from using quad-grams - our 
> > machine learners don't do feature selection very well.
> 
> Impressive. For comparision, my language models are roughly 3kB per 
> language, and the guesser works with nearly perfect accuracy for texts
> 
> longer than 10 words. Below that - it depends.. :-)

Impressive indeed. However, it is quite important that one can detect
the language of a query: a query is not very often 10 words. And it 
is the query I want to detect the laguange of when stemming.

Karsten, what specifics can you tell us about the algorithms? 

I'm going to take a look at Weka tonight and see if there I could
implement something like this for Lucene.

kalle

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: AW: N-gram layer and language guessing

Reply via email to