> > Here are my first benchs on language identifier. > Results are in ms and obtained by identifying the language of 2 documents > 1000 times each: > > Original code: 31971, 37313, 31306, 31475, 30818 > > Replacing Vectors by ArrayLists: 31092, 30959, 30897, 31770, 30907 > Not really significant, but that's normal, this test was performed in a > mono threaded env. > (could be significant in a highly multi-threaded env) > > Prototype of full NGrams index (as described above): 20494, 20341, 20516, > 20533, 20389 > This approach (not yet optimized) seems to give a gain of around 1/3 ... > > I continue on this way and send back a patch once the code seems to be > clear.
Last news about my language identifier performance improvements tests: By using fixed ngrams size to 3, in other words, by using 3-grams instead of variant ngrams size (setted by default between 1 and 4), the time to process my test set for bench drops down to the following results: 5491, 5509, 5558, 5506, 5450 Process time is divided by 6 from original code. But these results uses the old ngrams profiles (that gathers ngrams of different sizes). I must rebuild the ngrams profiles with only 3-grams in order to correctly bench the code. Sami, do you uses the whole set available at http://people.csail.mit.edu/people/koehn/publications/europarl/ , or just some parts of text to build the profiles? (If I correctly remember my previous works on ngrams, just a few Mo are necessary to have a representative set of 3-grams). Doug, is it possible to store the files used to build the ngrams profiles in the svn repository, or does it takes too much space? Jerome PS: I am not certain that my English is always very comprehensible for the natural english-speaking. Sorry! ;-) -- http://motrech.free.fr/ http://frutch.free.fr/
