>>
Instead I'll do some experiments with markov-chains on the n-grams. Hopefully this 
will yield quite a distinct difference between languages without wating to many 
clockticks.
>>

This approach can work, but will require lots more of training examples.

If you are interested in guessing the language of a query only, one simple
approach would be to use unstemmed, language-separated indexes. Simply
look the words up using the Lucene IndexReader; wherever you find unstemmed 
words of the query, it may be worthwhile to stemm the query in that language 
and search over the (stemmed) index of that language again.

This requires either redundant indexes (stemmed/unstemmed for each language) 
or a manipulation of the analyzers such that you index both stemmed and 
unstemmed versions of the same word. 

Regards,

Karsten


-----Ursprüngliche Nachricht-----
Von: karl wettin [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 6. Februar 2004 07:58
An: Lucene Developers List
Betreff: Re: AW: N-gram layer and language guessing


On Tue, 3 Feb 2004 11:39:40 +0100
"Karsten Konrad" <[EMAIL PROTECTED]> wrote:

> 
> Anyway, XtraMind's ngram language guesser gives the following
> best five results on the swedish examples discussed previously:
> 
> "jag heter kalle"
> 
> swedish 100,00 %
> norwegian 17,51 %
> danish 10,02 %
> africaans 9,53 %
> dutch 8,79 %
> 
> "vad heter du"
> 
> swedish 100,00 %
> dutch 20,97 %
> norwegian 14,68 %
> danish 11,07 %
> africaans 9,29 %


I spent all my time working on a better language guesser rather than building the 
stemmer. The results I got from Weka are OK, but due to the amount of calculations 
needed to guess the lagnuage of even the shortest of strings, it is not possible for 
me to use these alogrithms.

Instead I'll do some experiments with markov-chains on the n-grams. Hopefully this 
will yield quite a distinct difference between languages without wating to many 
clockticks.

Any thoughts onthe subject is welcome.

I'll get back with results.

-- 

kalle


-- 

kalle

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to