[Nutch-dev] Re: language identifier

Jérôme Charron Wed, 20 Apr 2005 10:02:38 -0700

On 4/17/05, J�r�me Charron <[EMAIL PROTECTED]> wrote:
> 
> Another way of improvement could be to have an index of all the ngrams (in 
> all languages).
> Each entry in the hastable store a list of pair <Lang, Freq> for each 
> language that contains this ngram in its file.
> This data structure avoid to loop on each profile and then on each ngram 
> of the document to identify,
> but only needs to loop on the ngrams of the document to identify.
> I think it could greatly improve preformances...




Here are my first benchs on language identifier.
Results are in ms and obtained by identifying the language of 2 documents 
1000 times each:

Original code: 31971, 37313, 31306, 31475, 30818

Replacing Vectors by ArrayLists: 31092, 30959, 30897, 31770, 30907
Not really significant, but that's normal, this test was performed in a mono 
threaded env.
(could be significant in a highly multi-threaded env)

Prototype of full NGrams index (as described above): 20494, 20341, 20516, 
20533, 20389
This approach (not yet optimized) seems to give a gain of around 1/3 ... 

I continue on this way and send back a patch once the code seems to be 
clear.

Jerome

-- 
http://motrech.free.fr/
http://frutch.free.fr/

[Nutch-dev] Re: language identifier

Reply via email to