Hi, does anybody here use a ngram-layer for fault-tolerant searching on *larger* texts? I ask because you can expect to see far more ngrams than words emerging from a text once you use at least quad-grams - and the number of different tokens indexed seems to be the most important parameter for Lucene's search speed.
Anyway, XtraMind's ngram language guesser gives the following best five results on the swedish examples discussed previously: "jag heter kalle" swedish 100,00 % norwegian 17,51 % danish 10,02 % africaans 9,53 % dutch 8,79 % "vad heter du" swedish 100,00 % dutch 20,97 % norwegian 14,68 % danish 11,07 % africaans 9,29 % The guesser uses only tri- and quad-grams and is based on a sophisticated machine learning algorithm instead of a raw TF/IDF-weighting. The upside of this is the "confidence" value for estimating how much you can trust the classification. The downside is the model size: 5MB for 15 languages, which comes mostly from using quad-grams - our machine learners don't do feature selection very well. Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -----Ursprüngliche Nachricht----- Von: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 3. Februar 2004 09:27 An: Lucene Developers List Betreff: Re: N-gram layer karl wettin wrote: > On Mon, 2 Feb 2004 20:10:57 +0100 > "Jean-Francois Halleux" <[EMAIL PROTECTED]> wrote: > > >>during the past days, I've developped such a language guesser myself >>as a basis for a Lucene analyzer. It is based on trigrams. It is >>already working but not yet in a "publishable" state. If you or others >>are interested I can offer the sources. > > > I use variable gramsize due to the toughness of detecting thelanguage > of very small texts such as a query. For instance, applying > bi->quadgram on the swedish sentance "Jag heter Karl" (my name is > Karl) is presumed to be in Afrikaans. Using uni->quadgram does the > trick. > > Also, I add peneltys for gram-sized words found the the text but not > in the classified language. This improved my results even more. > > And I've been considering applying markov-chains on the grams where it > still is hard to guess the language, such as Afrikaans vs. Dutch and > American vs. Brittish English. > > Let me know if you want a copy of my code. > > > Here is some testoutput: > [...] > As you see, single word penalty on uni->quad does the trick on even > the smallest of textstrings. Well, perhaps it's also a matter of the quality of the language profiles. In one of my projects I'm using language profiles constructed from 1-5 -grams, with total of 300 grams per language profile. I don't do any additional tricks with penalizing the high frequency words. If I run the above example, I get the following: "jag heter kalle" <input> - SV: 0.7197875 <input> - DN: 0.745925 <input> - NO: 0.747225 <input> - FI: 0.755475 <input> - NL: 0.7597125 <input> - EN: 0.76746875 <input> - FR: 0.77628125 <input> - GE: 0.7785125 <input> - IT: 0.796675 <input> - PL: 0.7984875 <input> - PT: 0.7995875 <input> - ES: 0.800775 <input> - RU: 0.88500625 However, for the text "vad heter du" (what's your name) the detection fails... :-) A question: what was your source for the representative hi-frequency words in various languages? Was it your training corpus or some publication? -- Best regards, Andrzej Bialecki ------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org) --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]