Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JeromeCharron: http://wiki.apache.org/nutch/LanguageIdentifierBenchs ------------------------------------------------------------------------------ == Introduction == - This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (nutch-0.7-dev) and the patches [http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch NUTCH-60-050526.patch] and NUTCH-60-050607.patch (see NewLanguageIdentifier for more details). + This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (`nutch-0.7-dev`) and the patches [http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch NUTCH-60-050526.patch] and [NUTCH-60-050605.patch http://issues.apache.org/jira/secure/attachment/12310539/NUTCH-60-050605.patch] [NUTCH-60-050607.patch http://issues.apache.org/jira/secure/attachment/12310616/NUTCH-60-050607.patch] (see NewLanguageIdentifier for more details). These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your ["Nutch"] configuration. @@ -17, +17 @@ The following matrix shows the LanguageIdentifierPlugin processing time in ''ms'' for many versions. Each patched version is configured to be comparable with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 4-grams for performing analysis. The ''Data Size'' row is the size of data in bytes used in each file to perform the identification. Other rows represent the following configurations: - * ''Nutch-0.7'': The nutch-0.7-dev LanguageIdentifierPlugin version (without patch). + * `Nutch-0.7`: The nutch-0.7-dev LanguageIdentifierPlugin version (without patch). - * ''NUTCH-60-050526'': The LanguageIdentifierPlugin code with NUTCH-60-050526.patch applied. + * `NUTCH-60-050526`: The LanguageIdentifierPlugin code with NUTCH-60-050526.patch applied. - * ''NUTCH-60-050607'': The LanguageIdentifierPlugin code with NUTCH-60-050607.patch applied. + * `NUTCH-60-050607`: The LanguageIdentifierPlugin code with NUTCH-60-050607.patch applied. || ||'''Nutch-0.7'''||||'''NUTCH-60-050526'''||||'''NUTCH-60-050607'''|| ||'''Data Size'''||'''time'''||'''time'''||'''%'''||'''time'''||'''%'''|| @@ -49, +49 @@ === Discussion === - ''TODO'' + * The NUTCH-60-050607.patch increases performances from `18.27%` to `70.29%` with an average of `24.33%`. + * The profiling of the code confirms what SamiSiren suggests in a [http://www.mail-archive.com/[email protected]/msg00501.html previous message]: ''"the most timeconsuming part of language identifier is splitting the text into ngrams and propably the biggest optimization could be done there"''. Profiling confirms this point and shows that the splitting of the text takes around `25%` of the whole process. + == Precision == === Data set === - These ''precision'' benchmarks were produced by testing the LanguageIdentifierPlugin on the '''Data Size'' first bytes from a set of : + These ''precision'' benchmarks were produced by testing the LanguageIdentifierPlugin on the '''Data Size''' first bytes from a set of : * 492 french files, * 487 english files, * 488 deutch files. ------------------------------------------------------- This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput a projector? How fast can you ride your desk chair down the office luge track? If you want to score the big prize, get to know the little guy. Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20 _______________________________________________ Nutch-cvs mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-cvs
