Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by JeromeCharron:
http://wiki.apache.org/nutch/LanguageIdentifierBenchs

------------------------------------------------------------------------------
  == Introduction ==
  
- This page provides some performance (code speed) and precision 
(identification accuracy) benchmarks of the LanguageIdentifierPlugin. These 
benchmarks were produced by analyzing results from the previous version 
(nutch-0.7-dev) and the patches 
[http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch 
NUTCH-60-050526.patch] and NUTCH-60-050607.patch (see NewLanguageIdentifier for 
more details).
+ This page provides some performance (code speed) and precision 
(identification accuracy) benchmarks of the LanguageIdentifierPlugin. These 
benchmarks were produced by analyzing results from the previous version 
(`nutch-0.7-dev`) and the patches 
[http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch 
NUTCH-60-050526.patch] and [NUTCH-60-050605.patch 
http://issues.apache.org/jira/secure/attachment/12310539/NUTCH-60-050605.patch] 
[NUTCH-60-050607.patch 
http://issues.apache.org/jira/secure/attachment/12310616/NUTCH-60-050607.patch] 
(see NewLanguageIdentifier for more details).
  
  These data can be usefull if you want to contribute in increasing the 
LanguageIdentifierPlugin performance and/or precision, or if you want to tune 
precisely your ["Nutch"] configuration.
  
@@ -17, +17 @@

  The following matrix shows the LanguageIdentifierPlugin processing time in 
''ms'' for many versions. Each patched version is configured to be comparable 
with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 
4-grams for performing analysis.
  The ''Data Size'' row is the size of data in bytes used in each file to 
perform the identification.
  Other rows represent the following configurations:
-  * ''Nutch-0.7'': The nutch-0.7-dev LanguageIdentifierPlugin version (without 
patch).
+  * `Nutch-0.7`: The nutch-0.7-dev LanguageIdentifierPlugin version (without 
patch).
-  * ''NUTCH-60-050526'': The LanguageIdentifierPlugin code with 
NUTCH-60-050526.patch applied.
+  * `NUTCH-60-050526`: The LanguageIdentifierPlugin code with 
NUTCH-60-050526.patch applied.
-  * ''NUTCH-60-050607'': The LanguageIdentifierPlugin code with 
NUTCH-60-050607.patch applied.
+  * `NUTCH-60-050607`: The LanguageIdentifierPlugin code with 
NUTCH-60-050607.patch applied.
  
  || ||'''Nutch-0.7'''||||'''NUTCH-60-050526'''||||'''NUTCH-60-050607'''||
  ||'''Data Size'''||'''time'''||'''time'''||'''%'''||'''time'''||'''%'''||
@@ -49, +49 @@

  
  === Discussion ===
  
- ''TODO''
+  * The NUTCH-60-050607.patch increases performances from `18.27%` to `70.29%` 
with an average of `24.33%`.
+  * The profiling of the code confirms what SamiSiren suggests in a 
[http://www.mail-archive.com/[email protected]/msg00501.html 
previous message]: ''"the most timeconsuming part of language identifier is 
splitting the text into ngrams and propably the biggest optimization could be 
done there"''. Profiling confirms this point and shows that the splitting of 
the text takes around `25%` of the whole process.  
   
+ 
  == Precision ==
  
  === Data set ===
  
- These ''precision'' benchmarks were produced by testing the 
LanguageIdentifierPlugin on the '''Data Size'' first bytes from a set of :
+ These ''precision'' benchmarks were produced by testing the 
LanguageIdentifierPlugin on the '''Data Size''' first bytes from a set of :
   * 492 french files,
   * 487 english files,
   * 488 deutch files.


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to