[jira] Commented: (LUCENE-2668) offset gap should be added regardless of existence of tokens in DocInverterPerField

David Smiley (JIRA) Thu, 30 Sep 2010 06:33:04 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916472#action_12916472
 ]


David Smiley commented on LUCENE-2668:
--------------------------------------

So to anyone who's commented on this issue that has done work on this class... 
do you know why it _conditionally_ decrements the position and then increments 
it later unconditionally?  Reference lines 156 & 188.  The very fact that it 
happens sometimes but not others is thwarting my efforts to have the term 
positions between multiple fields coincide (special purpose use case I have).  
I'm using a position filter that ensures that all terms for a value have 0 
position increment gap, even the first one.  But sometimes I have no value or I 
have a value that is a stop word.  My hacky work-around is to set the first 
value to each of these multi-valued fields be some dummy value that gets 
indexed.  This is ugly and wasteful on disk.

> offset gap should be added regardless of existence of tokens in 
> DocInverterPerField
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2668.patch, LUCENE-2668.patch, LUCENE-2668.patch, 
> Test.java
>
>
> Problem: If a multiValued field which contains a stop word (e.g. "will" in 
> the following sample) only value is analyzed by StopAnalyzer when indexing, 
> the offsets of the subsequent tokens are not correct.
> {code:title=indexing a multiValued field}
> doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED, 
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "will", Store.YES, Index.ANALYZED, 
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "use", Store.YES, Index.ANALYZED, 
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED, 
> TermVector.WITH_OFFSETS ) );
> {code}
> In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll 
> get the offset(start,end) for "use" and "Lucene" will be use(10,13) and 
> Lucene(14,20). But if you use StopAnalyzer, the offsets will be use(9,12) and 
> lucene(13,19). When searching, since searcher cannot know what analyzer was 
> used at indexing time, this problem causes out of alignment of FVH.
> Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to 
> false then offset gap is not added in DocInverterPerField:
> {code:title=DocInverterPerField.java}
> if (anyToken)
>   fieldState.offset += docState.analyzer.getOffsetGap(field);
> {code}
> I don't understand why the condition is there... If always the gap is added, 
> I think things are simple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2668) offset gap should be added regardless of existence of tokens in DocInverterPerField

Reply via email to