Re: language identifier

Sami Siren Mon, 18 Apr 2005 14:06:48 -0700

Stefan Groschupf wrote:

Hi Sami, Hi all,
i like the language identifier very much, but we notice that it slow down the indexing process 3 times. In case people index very large segments this is may a problem. I have a set of questions: + Can you tell me which corpus you used to generate the ngram files?


http://people.csail.mit.edu/people/koehn/publications/europarl/

+ some hand collected for languages not available there

+ Are there any plans to improve speed by fine tuning the implementation?


I will do some experiments how it could best be optimized.

+ Why use vectors instead of array lists?


No reason - either way.

+ Do you think it make sense to use thresholds? For example not generate a score for the complete profile but use only the top 10 ngrams and check if there is a clear best profile using a threshold. In case the result isn't clear use 10 more ngrams. etc.


I think we should try something more basic first.


--
 Sami Siren

Re: language identifier

Reply via email to