chinese token overlap bug in org.apache.nutch.summary.basic.BasicSummarizer.getSummary

Bupo Jung Wed, 13 Apr 2011 04:13:34 -0700

I use Nutch for Chinese search. I input a query string like
"可爱的小女生"(a lovely little girl),the chinese analyzer turn it to three query
token――
可爱、小女、女生. When using the tokens to get the summary of the result page, a
StringIndexOutOfBoundsException throw out. Here is the error log:


2010-12-15 12:18:43,505 ERROR searcher.NutchBean �C Exception occured while
executing search: java.lang.RuntimeException:
java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

java.lang.RuntimeException: java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:297)

at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:350)

at org.apache.nutch.searcher.NutchBean.main(NutchBean.java:410)

Caused by: java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

at java.util.concurrent.FutureTask.get(FutureTask.java:83)

at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:292)

… 2 more

Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: -1

at java.lang.String.substring(String.java:1937)

at
org.apache.nutch.summary.basic.BasicSummarizer.getSummary(BasicSummarizer.java:188)

at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:263)

at
org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:63)

at
org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:53)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

This is because there is overlap between the two query tokens “小女” and “女生”。

nutch/src/plugin/summary-basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java

line 188：

*if* (highlight.contains(t.term())) {
excerpt.addToken(t.term());
//when two tokens overlap，offset>t.startOffset()
//
excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));//this
is where the exception accur
excerpt.add(*new*Highlight(text.substring(t.startOffset(),t.endOffset())));
offset = t.endOffset();
endToken = Math.*min*(j +sumContext, tokens.length);
}


//Change code to fix the error：
*if* (highlight.contains(t.term())) {
excerpt.addToken(t.term());
//bupo changed the code to fix the chinese token overlap error 2010.12.15
*if*(offset < t.startOffset()){
excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));
excerpt.add(*new*Highlight(text.substring(t.startOffset(),t.endOffset())));
}*else*{
excerpt.add(*new*Highlight(text.substring(offset,t.endOffset())));
}//bupo
}

--

Yizhong Zhuang
Beijing University of Posts and Telecommunications
Email:[email protected]

chinese token overlap bug in org.apache.nutch.summary.basic.BasicSummarizer.getSummary

Reply via email to