[ http://issues.apache.org/jira/browse/NUTCH-60?page=comments#action_12313323 ]
Jerome Charron commented on NUTCH-60: ------------------------------------- Sami, * for the performance speed, I simply uncomment some lines commented as "used for benchs" in the main method of LanguageIdentifier. Then, I launch the TestIdentifier on a big test of file using the fileset command line argument. * for the performance quality, I just configure the language identifier plugin with the desired size of data to analyze, I comment the line of code uncommented for performance speed, and simply launch the command line with the fileset command line argument on a big set of documents of the same language with grep and wc commands piped in order to get the number of failed identifications: java org.apache.nutch.analysis.lang.LanguageIdentifier -identifyfileset /somewhere/fr/*.txt | grep -v "identified as fr" | wc -l Hope this can help. But you are true, a set of scripts could be a good idea. > Bad language identifier plugin performances > ------------------------------------------- > > Key: NUTCH-60 > URL: http://issues.apache.org/jira/browse/NUTCH-60 > Project: Nutch > Type: Improvement > Components: indexer > Reporter: Jerome Charron > Priority: Minor > Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch, > NUTCH-60-050607.patch > > As reported by Stefan Groschupf > (http://www.mail-archive.com/[email protected]/msg04090.html) > the language identifier plugin consumes a lot of processing time. > Some optimizations and/or configuration options are required. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
