> i like the language identifier very much, but we notice that it slow
> down the indexing process 3 times.
> In case people index very large segments this is may a problem.

I notice that there is two loops in the getSimilarity() method.
But I don't really understand why you use two loops sami (in fact, I don't 
understand why you first compare anotherProfile to currentProfile, and then 
compare currentProfile to anotherProfile?)

+ Do you think it make sense to use thresholds? For example not
> generate a score for the complete profile but use only the top 10
> ngrams and check if there is a clear best profile using a threshold. In
> case the result isn't clear use 10 more ngrams. etc.


I have implemented a very similar code during my PhD, so I have a little 
experience on language identification using ngrams.
I think it really make sense to use thresholds, because the more relevant 
ngrams are the first one and could be in most cases sufficient to identify 
the language (perhaps there will be a need to normalize the ngrams files 
each others by removing ngrams duplicated in many files?)

I think we must make some experiments on a basis corpus and perform some 
comparative bench.
Sami, if you need some help....


Jerome


-- 
http://motrech.free.fr/
http://frutch.free.fr/

Reply via email to