[
https://issues.apache.org/jira/browse/LUCENE-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916472#action_12916472
]
David Smiley commented on LUCENE-2668:
--------------------------------------
So to anyone who's commented on this issue that has done work on this class...
do you know why it _conditionally_ decrements the position and then increments
it later unconditionally? Reference lines 156 & 188. The very fact that it
happens sometimes but not others is thwarting my efforts to have the term
positions between multiple fields coincide (special purpose use case I have).
I'm using a position filter that ensures that all terms for a value have 0
position increment gap, even the first one. But sometimes I have no value or I
have a value that is a stop word. My hacky work-around is to set the first
value to each of these multi-valued fields be some dummy value that gets
indexed. This is ugly and wasteful on disk.
> offset gap should be added regardless of existence of tokens in
> DocInverterPerField
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2668
> URL: https://issues.apache.org/jira/browse/LUCENE-2668
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
> Reporter: Koji Sekiguchi
> Assignee: Koji Sekiguchi
> Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2668.patch, LUCENE-2668.patch, LUCENE-2668.patch,
> Test.java
>
>
> Problem: If a multiValued field which contains a stop word (e.g. "will" in
> the following sample) only value is analyzed by StopAnalyzer when indexing,
> the offsets of the subsequent tokens are not correct.
> {code:title=indexing a multiValued field}
> doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "will", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "use", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> {code}
> In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll
> get the offset(start,end) for "use" and "Lucene" will be use(10,13) and
> Lucene(14,20). But if you use StopAnalyzer, the offsets will be use(9,12) and
> lucene(13,19). When searching, since searcher cannot know what analyzer was
> used at indexing time, this problem causes out of alignment of FVH.
> Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to
> false then offset gap is not added in DocInverterPerField:
> {code:title=DocInverterPerField.java}
> if (anyToken)
> fieldState.offset += docState.analyzer.getOffsetGap(field);
> {code}
> I don't understand why the condition is there... If always the gap is added,
> I think things are simple.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]