Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by JeromeCharron:
http://wiki.apache.org/nutch/LanguageIdentifierBenchs

The comment on the change is:
New performance results + precision results

------------------------------------------------------------------------------
  == Introduction ==
  
- This page provides some performance benchmarks (not precision) of the 
LanguageIdentifierPlugin between the ''old'' (previous) version and the ''new'' 
(configurable) version (see NewLanguageIdentifier for more details).
+ This page provides some performance (code speed) and precision 
(identification accuracy) benchmarks of the LanguageIdentifierPlugin. These 
benchmarks were produced by analyzing results from the previous version 
(nutch-0.7-dev) and the patches 
[http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch 
NUTCH-60-050526.patch] and NUTCH-60-050607.patch (see NewLanguageIdentifier for 
more details).
  
- These data can be usefull if you want to contribute in increasing the 
LanguageIdentifierPlugin performances, or if you want to tune precisely your 
["Nutch"] configuration.
+ These data can be usefull if you want to contribute in increasing the 
LanguageIdentifierPlugin performance and/or precision, or if you want to tune 
precisely your ["Nutch"] configuration.
  
- == Data set ==
+ == Performance ==
  
- These benchmarks were produced by testing the LanguageIdentifierPlugin on a 
set of 492 french files representing a total size of 171,3 Mo. These files were 
extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ 
European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]''.
+ === Data set ===
  
- == Raw results ==
+ These ''performance'' benchmarks were produced by testing the 
LanguageIdentifierPlugin on a set of 492 french files representing a total size 
of 171,3 Mo. These files were extracted from the 
''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament 
Proceedings Parallel Corpus 1996-2003 Release v2]''.
  
- The following matrix shows the LanguageIdentifierPlugin processing time in 
''ms'' for different configurations.
- The ''Data Size'' row is the size of data in bytes used in each file to 
perform the identification (please notice that each test case reported in this 
matrix returns a good language identification).
+ === Raw results ===
+ 
+ The following matrix shows the LanguageIdentifierPlugin processing time in 
''ms'' for many versions. Each patched version is configured to be comparable 
with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 
4-grams for performing analysis.
+ The ''Data Size'' row is the size of data in bytes used in each file to 
perform the identification.
  Other rows represent the following configurations:
-  * ''P.V.'': The Previous Version of the LanguageIdentifierPlugin.
-  * ''[x-y]'': The new LanguageIdentifierPlugin version using ngrams from size 
''x'' to ''y'' to perform identification.
+  * ''Nutch-0.7'': The nutch-0.7-dev LanguageIdentifierPlugin version (without 
patch).
+  * ''NUTCH-60-050526'': The LanguageIdentifierPlugin code with 
NUTCH-60-050526.patch applied.
+  * ''NUTCH-60-050607'': The LanguageIdentifierPlugin code with 
NUTCH-60-050607.patch applied.
  
- ||'''Data 
Size'''||'''P.V.'''||'''[1-4]'''||'''[2-2]'''||'''[3-3]'''||'''[4-4]'''||'''[2-3]'''||'''[3-4]'''||'''[2-4]'''||
- ||'''128'''||8314||5124||1627||2245||1393||3073||2996||4243||
- ||'''256'''||7660||4950||1408||1604||1425||3033||2809||3983||
- ||'''512'''||8017||4917||1296||1525||1150||2990||2912||3959||
- ||'''1024'''||8265||7188||1672||1722||1200||2933||2876||4932||
- ||'''2048'''||11541||9252||2213||2909||2601||5438||5530||7307||
- ||'''4096'''||14989||12485||2938||4190||3856||7654||8543||10416||
- ||'''8192'''||21167||18289||4880||6621||5538||11259||12557||15302||
- ||'''16384'''||32295||29488||9028||11173||13130||17560||19809||23673||
- ||'''32768'''||52918||49417||16396||18446||20158||26879||30858||39311||
- ||'''65536'''||97527||91285||33242||33695||34490||50894||54398||71920||
- ||'''131072'''||167502||161258||56036||53706||53527||87603||90553||122413||
- 
||'''262144'''||304609||289395||107108||108841||108674||180461||165561||222535||
- 
||'''524288'''||463008||442028||151086||146601||156372||253797||245313||336378||
+ || ||'''Nutch-0.7'''||||'''NUTCH-60-050526'''||||'''NUTCH-60-050607'''||
+ ||'''Data Size'''||'''time'''||'''time'''||'''%'''||'''time'''||'''%'''||
+ ||128||2410||1485||38.38||716||70.29||
+ ||256||2842||1836||35.40||1048||63.12||
+ ||512||3759||2305||38.68||1649||56.13||
+ ||1024||5899||5130||13.04||2839||51.87||
+ ||2048||8581||7462||13.04||4534||47.16||
+ ||4096||12622||10513||16.71||8031||36.37||
+ ||8192||21360||18289||14.38||13803||35.38||
+ ||16384||32073||29488||8.06||23733||26.00||
+ ||32768||58535||49417||15.58||41994||28.26||
+ ||65536||99861||91285||8.59||81612||18.27||
+ ||131072||184083||161258||12.40||140501||23.68||
+ ||262144||309438||289395||6.48||244369||21.03||
+ ||524288||504145||442028||12.32||377693||25.08||
+ ||Total||1245608||1109891||10.90||942522||24.33||
+ ||Average||95816||85376.23||10.90||72501.69||24.33||
  
- == Graphical Representation ==
+ === Graphical representation ===
  
- [http://frutch.free.fr/images/nutch/langid-benchs01.png]
+ [http://frutch.free.fr/images/nutch/langid-benchs03.jpg]
  
- == Graphical Representation (log axis) ==
+ === Graphical representation (log axis) ===
  
- [http://frutch.free.fr/images/nutch/langid-benchs02.png]
+ [http://frutch.free.fr/images/nutch/langid-benchs04.jpg]
  
- == Discussion ==
+ === Discussion ===
+ 
+ ''TODO''
+  
+ == Precision ==
+ 
+ === Data set ===
+ 
+ These ''precision'' benchmarks were produced by testing the 
LanguageIdentifierPlugin on the '''Data Size'' first bytes from a set of :
+  * 492 french files,
+  * 487 english files,
+  * 488 deutch files.
+ (These files were extracted from the 
''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament 
Proceedings Parallel Corpus 1996-2003 Release v2]'').
+ 
+ === Raw results ===
+ 
+ || 
||||||||'''Nutch-0.7'''||||||||'''NUTCH-60-050605'''||||||||'''NUTCH-60-050607'''||
+ ||'''Data 
Size'''||'''avg'''||'''fr'''||'''en'''||'''de'''||'''avg'''||'''fr'''||'''en'''||'''de'''||'''avg'''||'''fr'''||'''en'''||'''de'''||
+ 
||8||38.84||36.99||10.47||69.06||14.00||2.64||2.67||36.68||51.11||48.37||19.30||85.66||
+ 
||16||70.38||58.74||75.15||77.25||45.64||13.41||68.17||55.33||94.06||97.36||87.68||97.13||
+ 
||32||66.51||55.08||86.86||57.58||56.43||41.26||73.92||54.10||98.56||99.59||96.30||99.80||
+ 
||64||97.14||97.15||97.54||96.72||65.35||53.86||84.80||57.38||99.93||100||99.79||100||
+ 
||128||97.90||94.51||99.79||99.39||77.81||70.53||89.32||73.57||100||100||100||100||
+ ||256||100||100||100||100||90.32||90.04||92.20||88.73||100||100||100||100||
+ ||512||100||100||100||100||96.93||98.17||97.54||95.08||100||100||100||100||
+ ||1024||100||100||100||100||99.59||99.80||99.79||99.18||100||100||100||100||
+ ||2048||100||100||100||100||100||100||100||100||100||100||100||100||
+ 
+ === Graphical representation ===
+ 
+ [http://frutch.free.fr/images/nutch/langid-benchs05.jpg]
+ 
+ === Discussion ===
  
  ''TODO''
  


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to