Re: chinese token overlap bug in org.apache.nutch.summary.basic.BasicSummarizer.getSummary

Julien Nioche Wed, 13 Apr 2011 04:19:33 -0700

Hi,

Nutch has moved away from handling the indexing and search itself and now
delegates that to SOLR as of versions 1.3 and 2.0 (both forthcoming). The
issue you described won't be fixed as this part of the code has been
removed. Users are encouraged to start using 1.3 and use SOLR for the
indexing and search.


Your comments should be useful to anyone having the same issue with Nutch <=
1.2, so thanks for sharing this.

Julien


2011/4/13 Bupo Jung <[email protected]>

> I use Nutch for Chinese search. I input a query string like
> "可爱的小女生"(a lovely little girl),the chinese analyzer turn it to three query
> token――
> 可爱、小女、女生. When using the tokens to get the summary of the result page, a
> StringIndexOutOfBoundsException throw out. Here is the error log:
>
> 2010-12-15 12:18:43,505 ERROR searcher.NutchBean �C Exception occured while
> executing search: java.lang.RuntimeException:
> java.util.concurrent.ExecutionException:
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>
> at
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:297)
>
> at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:350)
>
> at org.apache.nutch.searcher.NutchBean.main(NutchBean.java:410)
>
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>
> at
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:292)
>
> … 2 more
>
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: -1
>
> at java.lang.String.substring(String.java:1937)
>
> at
> org.apache.nutch.summary.basic.BasicSummarizer.getSummary(BasicSummarizer.java:188)
>
> at
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:263)
>
> at
> org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:63)
>
> at
> org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:53)
>
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
> at java.lang.Thread.run(Thread.java:662)
>
> This is because there is overlap between the two query tokens “小女” and
> “女生”。
>
>
> nutch/src/plugin/summary-basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java
>
> line 188：
>
> *if* (highlight.contains(t.term())) {
> excerpt.addToken(t.term());
> //when two tokens overlap，offset>t.startOffset()
> //
> excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));//this
> is where the exception accur
> excerpt.add(*new*
> Highlight(text.substring(t.startOffset(),t.endOffset())));
> offset = t.endOffset();
> endToken = Math.*min*(j +sumContext, tokens.length);
> }
>
>
> //Change code to fix the error：
> *if* (highlight.contains(t.term())) {
> excerpt.addToken(t.term());
> //bupo changed the code to fix the chinese token overlap error 2010.12.15
> *if*(offset < t.startOffset()){
> excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));
> excerpt.add(*new*
> Highlight(text.substring(t.startOffset(),t.endOffset())));
> }*else*{
> excerpt.add(*new*Highlight(text.substring(offset,t.endOffset())));
> }//bupo
> }
>
> --
>
> Yizhong Zhuang
> Beijing University of Posts and Telecommunications
> Email:[email protected]
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: chinese token overlap bug in org.apache.nutch.summary.basic.BasicSummarizer.getSummary

Reply via email to