[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Michael Busch (JIRA) Wed, 22 Jul 2009 00:19:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734022#action_12734022
 ]


Michael Busch commented on LUCENE-1448:
---------------------------------------

OK I think I have this basically working with old and new API (including 1693 
changes).

The approach I took is fairly simple, it doesn't require adding a new 
Attribute. I added the following method to TokenSteam:

{code:java}
  /**
   * This method is called by the consumer after the last token has been 
consumed, 
   * i.e. after {...@link #incrementToken()} returned <code>false</code> (using 
the new TokenStream API)
   * or after {...@link #next(Token)} or {...@link #next()} returned 
<code>null</code> (old TokenStream API).
   * <p/>
   * This method can be used to perform any end-of-stream operations, such as 
setting the final
   * offset of a stream. The final offset of a stream might differ from the 
offset of the last token
   * e.g. in case one or more whitespaces followed after the last token, but a 
{...@link WhitespaceTokenizer}
   * was used.
   * <p/>
   * 
   * @throws IOException
   */
  public void end() throws IOException {
    // do nothing by default
  }
{code}

Then I took Mike's patch and implemented end() in all classes where his patch 
added getFinalOffset(). 
E.g. in CharTokenizer the implementations looks like this:

{code:java}
  public void end() {
    // set final offset
    int finalOffset = input.correctOffset(offset);
    offsetAtt.setOffset(finalOffset, finalOffset);
  }
{code}

I changed DocInverterPerField to call end() after the stream is fully consumed 
and use what 
offsetAttribute.endOffset() returns as final offset.

I also added all new tests from Mike's latest patch. 
All unit tests, including the new ones, pass. Also test-tag.

I'm not posting a patch yet, because this depends on 1693.

Mike, Uwe, others: could you please review if this approach makes sense?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Reply via email to