I use Nutch for Chinese search. I input a query string like
"可爱的小女生"(a lovely little girl),the chinese analyzer turn it to three query
token――
可爱、小女、女生. When using the tokens to get the summary of the result page, a
StringIndexOutOfBoundsException throw out. Here is the error log:
2010-12-15 12:18:43,505 ERROR searcher.NutchBean �C Exception occured while
executing search: java.lang.RuntimeException:
java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
java.lang.RuntimeException: java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:297)
at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:350)
at org.apache.nutch.searcher.NutchBean.main(NutchBean.java:410)
Caused by: java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:292)
… 2 more
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: -1
at java.lang.String.substring(String.java:1937)
at
org.apache.nutch.summary.basic.BasicSummarizer.getSummary(BasicSummarizer.java:188)
at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:263)
at
org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:63)
at
org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:53)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
This is because there is overlap between the two query tokens “小女” and “女生”。
nutch/src/plugin/summary-basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java
line 188:
*if* (highlight.contains(t.term())) {
excerpt.addToken(t.term());
//when two tokens overlap,offset>t.startOffset()
//
excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));//this
is where the exception accur
excerpt.add(*new*Highlight(text.substring(t.startOffset(),t.endOffset())));
offset = t.endOffset();
endToken = Math.*min*(j +sumContext, tokens.length);
}
//Change code to fix the error:
*if* (highlight.contains(t.term())) {
excerpt.addToken(t.term());
//bupo changed the code to fix the chinese token overlap error 2010.12.15
*if*(offset < t.startOffset()){
excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));
excerpt.add(*new*Highlight(text.substring(t.startOffset(),t.endOffset())));
}*else*{
excerpt.add(*new*Highlight(text.substring(offset,t.endOffset())));
}//bupo
}
--
Yizhong Zhuang
Beijing University of Posts and Telecommunications
Email:[email protected]