[jira] Updated: (LUCENE-2668) offset gap should be added regardless of existence of tokens in DocInverterPerField

Koji Sekiguchi (JIRA) Sat, 25 Sep 2010 18:24:56 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Sekiguchi updated LUCENE-2668:
-----------------------------------

    Attachment: LUCENE-2668.patch

Here is my idea. It is very simple - always adds offset gap. This causes test 
failure in offset test of TestIndexWriter.
Can anyone explain why the condition (if(anyToken)) is there?

> offset gap should be added regardless of existence of tokens in 
> DocInverterPerField
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2668
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2668
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-2668.patch, Test.java
>
>
> Problem: If a multiValued field which contains a stop word (e.g. "will" in 
> the following sample) only value is analyzed by StopAnalyzer when indexing, 
> the offsets of the subsequent tokens are not correct.
> {code:title=indexing a multiValued field}
> doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED, 
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "will", Store.YES, Index.ANALYZED, 
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "use", Store.YES, Index.ANALYZED, 
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED, 
> TermVector.WITH_OFFSETS ) );
> {code}
> In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll 
> get the offset(start,end) for "use" and "Lucene" will be use(10,13) and 
> Lucene(14,20). But if you use StopAnalyzer, the offsets will be use(9,12) and 
> lucene(13,19). When searching, since searcher cannot know what analyzer was 
> used at indexing time, this problem causes out of alignment of FVH.
> Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to 
> false then offset gap is not added in DocInverterPerField:
> {code:title=DocInverterPerField.java}
> if (anyToken)
>   fieldState.offset += docState.analyzer.getOffsetGap(field);
> {code}
> I don't understand why the condition is there... If always the gap is added, 
> I think things are simple.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2668) offset gap should be added regardless of existence of tokens in DocInverterPerField

Reply via email to