After some long nights on benching and profiling the language identifier plugin, I just attach a new patch to language identifier plugin on Jira (http://issues.apache.org/jira/browse/NUTCH-60). This patch provides some configuration options that enable to specify the size of the data to use for language analysis and the NGrams sizes to uses. It also provides some optimizations that reduce the processing time from 70% to 20%, depending on the configuration (size of data to process), with an average gain of 25%. I will provides more detailled results of my benchs on the Wiki as soon as possible (http://wiki.apache.org/nutch/LanguageIdentifierBenchs) and some possible ways of improvements on http://wiki.apache.org/nutch/NewLanguageIdentifier.
Jerome -- http://motrech.free.fr/ http://frutch.free.fr/ ------------------------------------------------------- This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput a projector? How fast can you ride your desk chair down the office luge track? If you want to score the big prize, get to know the little guy. Play to win an NEC 61" plasma display: http://www.necitguy.com/?r _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
