On Tue, 03 Feb 2004 12:47:06 +0100 Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Karsten Konrad wrote: > > The guesser uses only tri- and quad-grams and is based on > > a sophisticated machine learning algorithm instead of a raw > > TF/IDF-weighting. The upside of this is the "confidence" > > value for estimating how much you can trust the > > classification. The downside is the model size: 5MB for 15 > > languages, which comes mostly from using quad-grams - our > > machine learners don't do feature selection very well. > > Impressive. For comparision, my language models are roughly 3kB per > language, and the guesser works with nearly perfect accuracy for texts > > longer than 10 words. Below that - it depends.. :-) Impressive indeed. However, it is quite important that one can detect the language of a query: a query is not very often 10 words. And it is the query I want to detect the laguange of when stemming. Karsten, what specifics can you tell us about the algorithms? I'm going to take a look at Weka tonight and see if there I could implement something like this for Lucene. kalle --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]