I notice that there is two loops in the getSimilarity() method.
But I don't really understand why you use two loops sami (in fact, I don't understand why you first compare anotherProfile to currentProfile, and then compare currentProfile to anotherProfile?)

This was implemented to get similarity calculation symmetric a.getsimilarity(b)==b.getSimilarity(a), but I quess this is not really a requirement and might slightly slow things down.


I have implemented a very similar code during my PhD, so I have a little experience on language identification using ngrams.
I think it really make sense to use thresholds, because the more relevant ngrams are the first one and could be in most cases sufficient to identify the language (perhaps there will be a need to normalize the ngrams files each others by removing ngrams duplicated in many files?)


I think we must make some experiments on a basis corpus and perform some comparative bench.
Sami, if you need some help....

help is always appreciated!

I think the most timeconsuming part of language identifier is splitting the text into ngrams and propably the biggest optimization could be done there.

perhaps a configurable variable to set maximum text length to be analyzed. also the minimum limit could be defined because with small amount of ngrams the performance (as quality) is very poor.

I'll do some experimets also to see how speed could be improved.

--
 Sami Siren





Reply via email to