[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Michael Busch (JIRA) Mon, 17 Nov 2008 11:56:06 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648290#action_12648290
 ]


Michael Busch commented on LUCENE-1448:
---------------------------------------

{quote}
Hmm we could do that... but it seems awkward to add new attributes that apply 
only to ending state of the tokenizer.
{quote}

Yeah. Also you wouldn't want to pay overhead in TokenFilters that can buffer 
tokens to serialize or clone those attributes for every token.

{quote}
I wonder if instead, w/ the new API, we could simply allow querying of certain 
attributes (offset, posincr) after incrementToken returns "false"?
{quote}

Yeah, maybe we can make the AttributeSource more sophisticated, so that it can 
distinguish between per-field (instance) and per-token attributes. But as a 
separate patch, not as part of LUCENE-1422.

{quote}
Why don't you commit the new TokenStream API first, and we can iterate on this 
issue & commit 2nd?
{quote}

OK, will do. I think 1422 is ready now.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Reply via email to