Karsten Konrad wrote:
Hi,

does anybody here use a ngram-layer for fault-tolerant searching on *larger* texts? I ask because you can expect to see far more ngrams than words emerging from a text once you use at least
quad-grams - and the number of different tokens indexed seems to be the most important parameter for Lucene's search speed.


Anyway, XtraMind's ngram language guesser gives the following best five results on the swedish examples discussed previously:

"jag heter kalle"

swedish 100,00 %
norwegian 17,51 %
danish 10,02 %
africaans 9,53 %
dutch 8,79 %

"vad heter du"

swedish 100,00 %
dutch 20,97 %
norwegian 14,68 %
danish 11,07 %
africaans 9,29 %

The guesser uses only tri- and quad-grams and is based on
a sophisticated machine learning algorithm instead of a raw
TF/IDF-weighting. The upside of this is the "confidence" value for estimating how much you can trust the classification. The downside is the model size: 5MB for 15 languages, which comes mostly from using quad-grams - our machine learners don't do feature selection very well.

Impressive. For comparision, my language models are roughly 3kB per language, and the guesser works with nearly perfect accuracy for texts longer than 10 words. Below that - it depends.. :-)


--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to