[Nutch-dev] Re: language identifier

Jérôme Charron Fri, 22 Apr 2005 03:42:24 -0700

> 
> Here are my first benchs on language identifier.
> Results are in ms and obtained by identifying the language of 2 documents 
> 1000 times each:
> 
> Original code: 31971, 37313, 31306, 31475, 30818
> 
> Replacing Vectors by ArrayLists: 31092, 30959, 30897, 31770, 30907
> Not really significant, but that's normal, this test was performed in a 
> mono threaded env.
> (could be significant in a highly multi-threaded env)
> 
> Prototype of full NGrams index (as described above): 20494, 20341, 20516, 
> 20533, 20389
> This approach (not yet optimized) seems to give a gain of around 1/3 ... 
> 
> I continue on this way and send back a patch once the code seems to be 
> clear.



Last news about my language identifier performance improvements tests:
By using fixed ngrams size to 3, in other words, by using 3-grams instead of 
variant ngrams size (setted by default between 1 and 4), the time to process 
my test set for bench drops down to the following results:
5491, 5509, 5558, 5506, 5450

Process time is divided by 6 from original code. But these results uses the 
old ngrams profiles (that gathers ngrams of different sizes). I must rebuild 
the ngrams profiles with only 3-grams in order to correctly bench the code.

Sami, do you uses the whole set available at 
http://people.csail.mit.edu/people/koehn/publications/europarl/ , or just 
some parts of text to build the profiles? (If I correctly remember my 
previous works on ngrams, just a few Mo are necessary to have a 
representative set of 3-grams).

Doug, is it possible to store the files used to build the ngrams profiles in 
the svn repository, or does it takes too much space?

Jerome

PS: I am not certain that my English is always very comprehensible for the 
natural english-speaking. Sorry!
;-)


-- 
http://motrech.free.fr/
http://frutch.free.fr/

[Nutch-dev] Re: language identifier

Reply via email to