Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JeromeCharron: http://wiki.apache.org/nutch/LanguageIdentifierBenchs The comment on the change is: New performance results + precision results ------------------------------------------------------------------------------ == Introduction == - This page provides some performance benchmarks (not precision) of the LanguageIdentifierPlugin between the ''old'' (previous) version and the ''new'' (configurable) version (see NewLanguageIdentifier for more details). + This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (nutch-0.7-dev) and the patches [http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch NUTCH-60-050526.patch] and NUTCH-60-050607.patch (see NewLanguageIdentifier for more details). - These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performances, or if you want to tune precisely your ["Nutch"] configuration. + These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your ["Nutch"] configuration. - == Data set == + == Performance == - These benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]''. + === Data set === - == Raw results == + These ''performance'' benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]''. - The following matrix shows the LanguageIdentifierPlugin processing time in ''ms'' for different configurations. - The ''Data Size'' row is the size of data in bytes used in each file to perform the identification (please notice that each test case reported in this matrix returns a good language identification). + === Raw results === + + The following matrix shows the LanguageIdentifierPlugin processing time in ''ms'' for many versions. Each patched version is configured to be comparable with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 4-grams for performing analysis. + The ''Data Size'' row is the size of data in bytes used in each file to perform the identification. Other rows represent the following configurations: - * ''P.V.'': The Previous Version of the LanguageIdentifierPlugin. - * ''[x-y]'': The new LanguageIdentifierPlugin version using ngrams from size ''x'' to ''y'' to perform identification. + * ''Nutch-0.7'': The nutch-0.7-dev LanguageIdentifierPlugin version (without patch). + * ''NUTCH-60-050526'': The LanguageIdentifierPlugin code with NUTCH-60-050526.patch applied. + * ''NUTCH-60-050607'': The LanguageIdentifierPlugin code with NUTCH-60-050607.patch applied. - ||'''Data Size'''||'''P.V.'''||'''[1-4]'''||'''[2-2]'''||'''[3-3]'''||'''[4-4]'''||'''[2-3]'''||'''[3-4]'''||'''[2-4]'''|| - ||'''128'''||8314||5124||1627||2245||1393||3073||2996||4243|| - ||'''256'''||7660||4950||1408||1604||1425||3033||2809||3983|| - ||'''512'''||8017||4917||1296||1525||1150||2990||2912||3959|| - ||'''1024'''||8265||7188||1672||1722||1200||2933||2876||4932|| - ||'''2048'''||11541||9252||2213||2909||2601||5438||5530||7307|| - ||'''4096'''||14989||12485||2938||4190||3856||7654||8543||10416|| - ||'''8192'''||21167||18289||4880||6621||5538||11259||12557||15302|| - ||'''16384'''||32295||29488||9028||11173||13130||17560||19809||23673|| - ||'''32768'''||52918||49417||16396||18446||20158||26879||30858||39311|| - ||'''65536'''||97527||91285||33242||33695||34490||50894||54398||71920|| - ||'''131072'''||167502||161258||56036||53706||53527||87603||90553||122413|| - ||'''262144'''||304609||289395||107108||108841||108674||180461||165561||222535|| - ||'''524288'''||463008||442028||151086||146601||156372||253797||245313||336378|| + || ||'''Nutch-0.7'''||||'''NUTCH-60-050526'''||||'''NUTCH-60-050607'''|| + ||'''Data Size'''||'''time'''||'''time'''||'''%'''||'''time'''||'''%'''|| + ||128||2410||1485||38.38||716||70.29|| + ||256||2842||1836||35.40||1048||63.12|| + ||512||3759||2305||38.68||1649||56.13|| + ||1024||5899||5130||13.04||2839||51.87|| + ||2048||8581||7462||13.04||4534||47.16|| + ||4096||12622||10513||16.71||8031||36.37|| + ||8192||21360||18289||14.38||13803||35.38|| + ||16384||32073||29488||8.06||23733||26.00|| + ||32768||58535||49417||15.58||41994||28.26|| + ||65536||99861||91285||8.59||81612||18.27|| + ||131072||184083||161258||12.40||140501||23.68|| + ||262144||309438||289395||6.48||244369||21.03|| + ||524288||504145||442028||12.32||377693||25.08|| + ||Total||1245608||1109891||10.90||942522||24.33|| + ||Average||95816||85376.23||10.90||72501.69||24.33|| - == Graphical Representation == + === Graphical representation === - [http://frutch.free.fr/images/nutch/langid-benchs01.png] + [http://frutch.free.fr/images/nutch/langid-benchs03.jpg] - == Graphical Representation (log axis) == + === Graphical representation (log axis) === - [http://frutch.free.fr/images/nutch/langid-benchs02.png] + [http://frutch.free.fr/images/nutch/langid-benchs04.jpg] - == Discussion == + === Discussion === + + ''TODO'' + + == Precision == + + === Data set === + + These ''precision'' benchmarks were produced by testing the LanguageIdentifierPlugin on the '''Data Size'' first bytes from a set of : + * 492 french files, + * 487 english files, + * 488 deutch files. + (These files were extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]''). + + === Raw results === + + || ||||||||'''Nutch-0.7'''||||||||'''NUTCH-60-050605'''||||||||'''NUTCH-60-050607'''|| + ||'''Data Size'''||'''avg'''||'''fr'''||'''en'''||'''de'''||'''avg'''||'''fr'''||'''en'''||'''de'''||'''avg'''||'''fr'''||'''en'''||'''de'''|| + ||8||38.84||36.99||10.47||69.06||14.00||2.64||2.67||36.68||51.11||48.37||19.30||85.66|| + ||16||70.38||58.74||75.15||77.25||45.64||13.41||68.17||55.33||94.06||97.36||87.68||97.13|| + ||32||66.51||55.08||86.86||57.58||56.43||41.26||73.92||54.10||98.56||99.59||96.30||99.80|| + ||64||97.14||97.15||97.54||96.72||65.35||53.86||84.80||57.38||99.93||100||99.79||100|| + ||128||97.90||94.51||99.79||99.39||77.81||70.53||89.32||73.57||100||100||100||100|| + ||256||100||100||100||100||90.32||90.04||92.20||88.73||100||100||100||100|| + ||512||100||100||100||100||96.93||98.17||97.54||95.08||100||100||100||100|| + ||1024||100||100||100||100||99.59||99.80||99.79||99.18||100||100||100||100|| + ||2048||100||100||100||100||100||100||100||100||100||100||100||100|| + + === Graphical representation === + + [http://frutch.free.fr/images/nutch/langid-benchs05.jpg] + + === Discussion === ''TODO'' ------------------------------------------------------- This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput a projector? How fast can you ride your desk chair down the office luge track? If you want to score the big prize, get to know the little guy. Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20 _______________________________________________ Nutch-cvs mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-cvs
