Hi Sami, Hi all,
i like the language identifier very much, but we notice that it slow down the indexing process 3 times.
In case people index very large segments this is may a problem.
I have a set of questions:
+ Can you tell me which corpus you used to generate the ngram files?
+ Are there any plans to improve speed by fine tuning the implementation?
+ Why use vectors instead of array lists?
+ Do you think it make sense to use thresholds? For example not generate a score for the complete profile but use only the top 10 ngrams and check if there is a clear best profile using a threshold. In case the result isn't clear use 10 more ngrams. etc.
Thanks for any comments, Stefan
------------------------------------------------------------- Hommingberger Gepardenforelle http://wiki.media-style.com/display/~hommingbergergepardenforelle
