[
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734022#action_12734022
]
Michael Busch commented on LUCENE-1448:
---------------------------------------
OK I think I have this basically working with old and new API (including 1693
changes).
The approach I took is fairly simple, it doesn't require adding a new
Attribute. I added the following method to TokenSteam:
{code:java}
/**
* This method is called by the consumer after the last token has been
consumed,
* i.e. after {...@link #incrementToken()} returned <code>false</code> (using
the new TokenStream API)
* or after {...@link #next(Token)} or {...@link #next()} returned
<code>null</code> (old TokenStream API).
* <p/>
* This method can be used to perform any end-of-stream operations, such as
setting the final
* offset of a stream. The final offset of a stream might differ from the
offset of the last token
* e.g. in case one or more whitespaces followed after the last token, but a
{...@link WhitespaceTokenizer}
* was used.
* <p/>
*
* @throws IOException
*/
public void end() throws IOException {
// do nothing by default
}
{code}
Then I took Mike's patch and implemented end() in all classes where his patch
added getFinalOffset().
E.g. in CharTokenizer the implementations looks like this:
{code:java}
public void end() {
// set final offset
int finalOffset = input.correctOffset(offset);
offsetAtt.setOffset(finalOffset, finalOffset);
}
{code}
I changed DocInverterPerField to call end() after the stream is fully consumed
and use what
offsetAttribute.endOffset() returns as final offset.
I also added all new tests from Mike's latest patch.
All unit tests, including the new ones, pass. Also test-tag.
I'm not posting a patch yet, because this depends on 1693.
Mike, Uwe, others: could you please review if this approach makes sense?
> add getFinalOffset() to TokenStream
> -----------------------------------
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Reporter: Michael McCandless
> Assignee: Michael Busch
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch,
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a
> document, and you then index those fields with TermVectors storing offsets,
> it's very likely the offsets for all but the first field instance will be
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the
> offsets of each field instance, where that base is 1 + the endOffset of the
> last token it saw when analyzing that field.
> But this logic is overly simplistic. For example, if the WhitespaceAnalyzer
> is being used, and the text being analyzed ended in 3 whitespace characters,
> then that information is lost and then next field's offsets are then all 3
> too small. Similarly, if a StopFilter appears in the chain, and the last N
> tokens were stop words, then the base will be 1 + the endOffset of the last
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm
> thinking by default it returns -1, which means "I don't know so you figure it
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]