[Nutch-dev] Re: language identifier

Jérôme Charron Sat, 16 Apr 2005 15:39:22 -0700

Another way of improvement could be to have an index of all the ngrams (in 
all languages).
Each entry in the hastable store a list of pair <Lang, Freq> for each 
language that contains this ngram in its file.
This data structure avoid to loop on each profile and then on each ngram of 
the document to identify,
but only needs to loop on the ngrams of the document to identify.
I think it could greatly improve preformances...
Comments?


Jerome

-- 
http://motrech.free.fr/
http://frutch.free.fr/

[Nutch-dev] Re: language identifier

Reply via email to