offset gap should be added regardless of existence of tokens in
DocInverterPerField
-----------------------------------------------------------------------------------
Key: LUCENE-2668
URL: https://issues.apache.org/jira/browse/LUCENE-2668
Project: Lucene - Java
Issue Type: Bug
Components: Index
Affects Versions: 3.0.2, 2.9.3, 3.1, 4.0
Reporter: Koji Sekiguchi
Priority: Minor
Problem: If a multiValued field which contains a stop word (e.g. "will" in the
following sample) only value is analyzed by StopAnalyzer when indexing, the
offsets of the subsequent tokens are not correct.
{code:title=indexing a multiValued field}
doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
doc.add( new Field( F, "will", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
doc.add( new Field( F, "use", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED,
TermVector.WITH_OFFSETS ) );
{code}
In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll
get the offset(start,end) for "use" and "Lucene" will be use(10,13) and
Lucene(14,20). But if you use StopAnalyzer, the offsets will be use(9,12) and
lucene(13,19). When searching, since searcher cannot know what analyzer was
used at indexing time, this problem causes out of alignment of FVH.
Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to
false then offset gap is not added in DocInverterPerField:
{code:title=DocInverterPerField.java}
if (anyToken)
fieldState.offset += docState.analyzer.getOffsetGap(field);
{code}
I don't understand why the condition is there... If always the gap is added, I
think things are simple.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]