[
http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ]
Jack Tang commented on NUTCH-36:
--------------------------------
Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in
Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4), the result of
bi-gram is:
matched-image start-offset end-offset
(c1)(c2) 0 2
(c2)(c3) 1 3
(c3)(c4) 2 4
In search summaries, we should merge the tokens if the index is overlaped. You
can follow this:
change code
if (highlight.contains(t.termText())) {
excerpt.addToken(t.termText());
excerpt.add(new Fragment(text.substring(offset, t.startOffset())));
excerpt.add(new
Highlight(text.substring(t.startOffset(),t.endOffset())));
offset = t.endOffset();
endToken = Math.min(j+SUM_CONTEXT, tokens.length);
}
to
if (highlight.contains(t.termText())) {
if(offset * 2 == (t.startOffset() + t.endOffset() )) { // cjk
bi-gram
excerpt.addToken(t.termText().substring(offset -
t.startOffset()));
excerpt.add(new Fragment(text.substring(t.startOffset() +
1,offset)));
excerpt.add(new Highlight(text.substring(t.startOffset() + 1
,t.endOffset())));
}
else {
excerpt.addToken(t.termText());
excerpt.add(new Fragment(text.substring(offset,
t.startOffset())));
excerpt.add(new Highlight(text.substring(t.startOffset()
,t.endOffset())));
}
offset = t.endOffset();
endToken = Math.min(j+SUM_CONTEXT, tokens.length);
}
> Chinese in Nutch
> ----------------
>
> Key: NUTCH-36
> URL: http://issues.apache.org/jira/browse/NUTCH-36
> Project: Nutch
> Type: Improvement
> Components: indexer, searcher
> Environment: all
> Reporter: Jack Tang
> Priority: Minor
> Attachments: 桌
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term
> word-by-word.
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'),
> the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we
> expect Nutch only highlights 'FooBar'.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira