[
https://issues.apache.org/jira/browse/LUCENE-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Sekiguchi updated LUCENE-2668:
-----------------------------------
Attachment: LUCENE-2668.patch
Here is my idea. It is very simple - always adds offset gap. This causes test
failure in offset test of TestIndexWriter.
Can anyone explain why the condition (if(anyToken)) is there?
> offset gap should be added regardless of existence of tokens in
> DocInverterPerField
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2668
> URL: https://issues.apache.org/jira/browse/LUCENE-2668
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
> Reporter: Koji Sekiguchi
> Priority: Minor
> Attachments: LUCENE-2668.patch, Test.java
>
>
> Problem: If a multiValued field which contains a stop word (e.g. "will" in
> the following sample) only value is analyzed by StopAnalyzer when indexing,
> the offsets of the subsequent tokens are not correct.
> {code:title=indexing a multiValued field}
> doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "will", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "use", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> {code}
> In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll
> get the offset(start,end) for "use" and "Lucene" will be use(10,13) and
> Lucene(14,20). But if you use StopAnalyzer, the offsets will be use(9,12) and
> lucene(13,19). When searching, since searcher cannot know what analyzer was
> used at indexing time, this problem causes out of alignment of FVH.
> Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to
> false then offset gap is not added in DocInverterPerField:
> {code:title=DocInverterPerField.java}
> if (anyToken)
> fieldState.offset += docState.analyzer.getOffsetGap(field);
> {code}
> I don't understand why the condition is there... If always the gap is added,
> I think things are simple.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]