[ http://issues.apache.org/jira/browse/LUCENE-627?page=comments#action_12434087 ] Kerang Lv commented on LUCENE-627: ----------------------------------
Hi Yonik, I'm trying to add support for some overlapping bigram analyzer, e.g. the CJKAnalyzer(http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/) onto your patch. With your patch, the following test fails with: Expected :一<B>二三</B>四五<B>六七</B>八九十 Actual :一<B>二三四五六七</B> public void testOverlapAnalyzer4() throws Exception { String s = "一二三四五六七八九十"; // the token stream for the string above: TokenStream ts = new TokenStream() { Iterator iter; { List lst = new ArrayList(); Token t; t = new Token("一二",0,2); lst.add(t); t = new Token("二三",1,3); lst.add(t); t = new Token("三四",2,4); lst.add(t); t = new Token("四五",3,5); lst.add(t); t = new Token("五六",4,6); lst.add(t); t = new Token("六七",5,7); lst.add(t); t = new Token("七八",6,8); lst.add(t); t = new Token("八九",7,9); lst.add(t); t = new Token("九十",8,10); lst.add(t); iter = lst.iterator(); } public Token next() throws IOException { return iter.hasNext() ? (Token)iter.next() : null; } }; String srchkey = "二三 六七"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(ts, s, 3, "..."); String expectedResult="一<B>二三</B>四五<B>六七</B>八九十"; assertEquals(expectedResult,result); } With some overlapping bigram analyzer, the current token's startOffset is the previous token's endOffset - 1, so the TokenGroup.isDistinct(token) returns false the most time, which lead to bad range tokenText. Here is a patch that makes the tests work. > highlighter problems with overlapping tokens > -------------------------------------------- > > Key: LUCENE-627 > URL: http://issues.apache.org/jira/browse/LUCENE-627 > Project: Lucene - Java > Issue Type: Bug > Components: Other > Affects Versions: 2.0.1 > Reporter: Yonik Seeley > Fix For: 2.0.1 > > Attachments: highlight_overlap.diff > > > The lucene highlighter has problems when tokens that overlap are generated. > For example, if analysis of iPod generates the tokens "i", "pod", "ipod" > (with pod and ipod in the same position), > then the highlighter will output this as iipod, regardless of if any of those > tokens are highlighted. > Discovered via http://issues.apache.org/jira/browse/SOLR-24 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]