[ 
http://issues.apache.org/jira/browse/NUTCH-60?page=comments#action_12313323 ] 

Jerome Charron commented on NUTCH-60:
-------------------------------------

Sami, 

* for the performance speed, I simply uncomment some lines commented as "used 
for benchs" in the main method of LanguageIdentifier. Then, I launch the 
TestIdentifier on a big test of file using the fileset command line argument.

* for the performance quality, I just configure the language identifier plugin 
with the desired size of data to analyze, I comment the line of code 
uncommented for performance speed, and simply launch the command line with the 
fileset command line argument on a big set of documents of the same language 
with grep and wc commands piped in order to get the number of failed 
identifications:
java org.apache.nutch.analysis.lang.LanguageIdentifier -identifyfileset 
/somewhere/fr/*.txt | grep -v "identified as fr" | wc -l

 Hope this can help. But you are true, a set of scripts could be a good idea.

> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch, 
> NUTCH-60-050607.patch
>
> As reported by Stefan Groschupf 
> (http://www.mail-archive.com/[email protected]/msg04090.html)
>  the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to