[ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ]
Jack Tang commented on NUTCH-36: -------------------------------- Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4), the result of bi-gram is: matched-image start-offset end-offset (c1)(c2) 0 2 (c2)(c3) 1 3 (c3)(c4) 2 4 In search summaries, we should merge the tokens if the index is overlaped. You can follow this: change code if (highlight.contains(t.termText())) { excerpt.addToken(t.termText()); excerpt.add(new Fragment(text.substring(offset, t.startOffset()))); excerpt.add(new Highlight(text.substring(t.startOffset(),t.endOffset()))); offset = t.endOffset(); endToken = Math.min(j+SUM_CONTEXT, tokens.length); } to if (highlight.contains(t.termText())) { if(offset * 2 == (t.startOffset() + t.endOffset() )) { // cjk bi-gram excerpt.addToken(t.termText().substring(offset - t.startOffset())); excerpt.add(new Fragment(text.substring(t.startOffset() + 1,offset))); excerpt.add(new Highlight(text.substring(t.startOffset() + 1 ,t.endOffset()))); } else { excerpt.addToken(t.termText()); excerpt.add(new Fragment(text.substring(offset, t.startOffset()))); excerpt.add(new Highlight(text.substring(t.startOffset() ,t.endOffset()))); } offset = t.endOffset(); endToken = Math.min(j+SUM_CONTEXT, tokens.length); } > Chinese in Nutch > ---------------- > > Key: NUTCH-36 > URL: http://issues.apache.org/jira/browse/NUTCH-36 > Project: Nutch > Type: Improvement > Components: indexer, searcher > Environment: all > Reporter: Jack Tang > Priority: Minor > Attachments: 桌 > > Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term > word-by-word. > So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), > the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we > expect Nutch only highlights 'FooBar'. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers