[Nutch-dev] language identifier

Stefan Groschupf Sat, 16 Apr 2005 14:25:08 -0700

Hi Sami, Hi all,

i like the language identifier very much, but we notice that it slow down the indexing process 3 times. In case people index very large segments this is may a problem.

I have a set of questions: + Can you tell me which corpus you used to generate the ngram files? + Are there any plans to improve speed by fine tuning the implementation? + Why use vectors instead of array lists? + Do you think it make sense to use thresholds? For example not generate a score for the complete profile but use only the top 10 ngrams and check if there is a clear best profile using a threshold. In case the result isn't clear use 10 more ngrams. etc.


Thanks for any comments,
Stefan


-------------------------------------------------------------
Hommingberger Gepardenforelle
http://wiki.media-style.com/display/~hommingbergergepardenforelle

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] language identifier

Reply via email to