[ http://issues.apache.org/jira/browse/NUTCH-60?page=all ]
Jerome Charron updated NUTCH-60:
--------------------------------
Attachment: NUTCH-60-050526.patch
Patch with some minor performances improvements, but with some configurations
parameters that enable to improve performances.
See http://wiki.apache.org/nutch/LanguageIdentifierBenchs and
http://wiki.apache.org/nutch/NewLanguageIdentifier (coming soon) for more
details.
Shortly, it adds the following configuration parameters:
* lang.ngram.min.length : The minimum size of ngrams to uses to identify
language (must be between 1 and lang.ngram.max.length). The larger is the range
between lang.ngram.min.length and lang.ngram.max.length, the better is the
identification, but the slowest it is.
* lang.ngram.max.length: The maximum size of ngrams to uses to identify
language (must be between lang.ngram.min.length and 4). The larger is the range
between lang.ngram.min.length and lang.ngram.max.length, the better is the
identification, but the slowest it is.
* lang.analyze.max.length: The maximum bytes of data to uses to indentify the
language (0 means full content analysis). The larger is this value, the better
is the analyzis, but the slowest it is.
Some new ngram profiles have been generated for en, es, fr, nl, it, pt, da, sv,
de, fi, el cause the new implementation need more ngrams in the profile, but it
is backward compatible with old ones.
Some unitary tests added.
> Bad language identifier plugin performances
> -------------------------------------------
>
> Key: NUTCH-60
> URL: http://issues.apache.org/jira/browse/NUTCH-60
> Project: Nutch
> Type: Improvement
> Components: indexer
> Reporter: Jerome Charron
> Priority: Minor
> Attachments: NUTCH-60-050526.patch
>
> As reported by Stefan Groschupf
> (http://www.mail-archive.com/[email protected]/msg04090.html)
> the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira