I've been digging on a similar issue and eventually found this Jira ticket.
https://issues.apache.org/jira/browse/LUCENE-2229 So far I haven't received any response in IRC or from the mailing list, and the bug is resolved as "won't fix" even though there's a patch attached that attempts to solve the issue. For now I have given up. I'm assuming that most of the Lucene community just doesn't use that highlighter anymore. It is also difficult to reproduce the issue, so it probably doesn't cause a problem all that often. It isn't worth my time right now to dig much deeper. On Tue, Aug 11, 2015 at 10:38 AM, Duke DAI <duke.dai....@gmail.com> wrote: > Greetings! > > Any body has input on this? > > Best regards, > Duke > If not now, when? If not me, who? > > On Fri, Aug 7, 2015 at 10:58 AM, Duke DAI <duke.dai....@gmail.com> wrote: > > > Hi experts, > > > > I'm trying to reproduce a bug from Lucene side, and found something. > > > > In latest codeline, 5.2.1, I modified test > > case HighlighterTest.testSimpleQueryTermScorerHighlighter a little to > > below, mainly to use SimpleSpanFragmenter to get only one fragment with > > length 64. > > > > public void testSimpleQueryTermScorerHighlighter() throws Exception { > > doSearching(new SpanTermQuery(new Term(FIELD_NAME, "cats"))); > > QueryScorer queryScorer = new QueryScorer(query, FIELD_NAME); > > Highlighter highlighter = new Highlighter(queryScorer); > > // Highlighter highlighter = new Highlighter(new > > QueryTermScorer(query)); > > highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer, > > 64)); > > int maxNumFragmentsRequired = 1; // only need one fragment > > for (int i = 0; i < hits.totalHits; i++) { > > final int docId = hits.scoreDocs[i].doc; > > final Document doc = searcher.doc(docId); > > String text = doc.get(FIELD_NAME); > > TokenStream tokenStream = getAnyTokenStream(FIELD_NAME, docId); > > > > String result = highlighter.getBestFragments(tokenStream, text, > > maxNumFragmentsRequired, > > "..."); > > if (true) System.out.println("\t" + result); > > } > > // Not sure we can assert anything here - just running to check we > dont > > // throw any exceptions > > } > > > > With two documents: > > 1. "The word content does not contain the stem that we are looking for > but > > the metadata cats does. Do you think fragmenter work well? Do you think > > fragmenter work well?" > > 2. "The word content does not contain the stem that we are looking for > but > > the metadata cats does. " > > Got corresponding fragment: > > 1. "for but the metadata <B>cats</B> does. Do you think fragmenter work", > > no problem, it's exact what I expected. > > 2. "The word content does not contain the stem that we are looking for > but > > the metadata <B>cats</B> does. ", apparently the length is more than 64. > > That's the problem reported by my colleague. > > > > More specific, the problem is caused by below code snippet in > > SimpleSpanFragmenter.isNewFragment: > > > > boolean isNewFrag = offsetAtt.endOffset() >= (fragmentSize * > > currentNumFrags) > > && (textSize - offsetAtt.endOffset()) >= (fragmentSize >>> 1); > > > > At the end of text, fragmenter can't stop well and following logic also > > does not do the trim work. > > > > > > Is it possible to handle this corner case in standard highlighter code? > > > > > > > > Best regards, > > Duke > > If not now, when? If not me, who? > > >