On 4/17/05, J�r�me Charron <[EMAIL PROTECTED]> wrote: > > Another way of improvement could be to have an index of all the ngrams (in > all languages). > Each entry in the hastable store a list of pair <Lang, Freq> for each > language that contains this ngram in its file. > This data structure avoid to loop on each profile and then on each ngram > of the document to identify, > but only needs to loop on the ngrams of the document to identify. > I think it could greatly improve preformances...
Here are my first benchs on language identifier. Results are in ms and obtained by identifying the language of 2 documents 1000 times each: Original code: 31971, 37313, 31306, 31475, 30818 Replacing Vectors by ArrayLists: 31092, 30959, 30897, 31770, 30907 Not really significant, but that's normal, this test was performed in a mono threaded env. (could be significant in a highly multi-threaded env) Prototype of full NGrams index (as described above): 20494, 20341, 20516, 20533, 20389 This approach (not yet optimized) seems to give a gain of around 1/3 ... I continue on this way and send back a patch once the code seems to be clear. Jerome -- http://motrech.free.fr/ http://frutch.free.fr/
