Gustavo Rauber created NUTCH-1547:
-------------------------------------

             Summary: BasicIndexingFilter - Problem to index full title
                 Key: NUTCH-1547
                 URL: https://issues.apache.org/jira/browse/NUTCH-1547
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.6
            Reporter: Gustavo Rauber
            Priority: Minor


I have faced this issue when trying to index the entire title, just like the 
content, configuring its value on nutch-default.xml to -1 
(indexer.max.title.length). I think the behavior should be the same as the 
content.

If you would like to fix it, just replace the line number 90:

if (title.length() > MAX_TITLE_LENGTH) {      // truncate title if needed

by this one:

if (MAX_TITLE_LENGTH > -1 && title.length() > MAX_TITLE_LENGTH) {      // 
truncate title if needed


Stack Trace:

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.substring(String.java:1937)
        at 
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91)
        at 
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109)
        at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272)
        at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
        at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)


Cheers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to