[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Michael McCandless (JIRA) Tue, 11 Nov 2008 11:44:39 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-1448:
---------------------------------------

    Attachment: LUCENE-1448.patch

Attached new patch (changes described below):

bq. You need that +1 or you will have the subsequent token starting on the tail 
of the 'stopword'. 

So logically it's like we silently & forcefully insert a space between
the Fieldable instances?

Maybe we should add Analyzer.getOffsetGap(String fieldName), which by
default would return 1, and we then add that into the offset for
subsequent field instances?

But then here's another challenge: for NOT_ANALYZED fields we don't
add this extra +1.  We just add the string length.  Hmm.

OK I added Analyzer.getOffsetGap(Fieldable), and defaulted it to
return 1 for analyzed fields and 0 for unanalyzed fields.

bq. What's wrong with public int getFinalOffset() { return scanner.yychar() + 
scanner.yylength(); }

Does that handle spaces at the end of the text?  (Oh it seems like it
does...I added a test case...hmm).

bq. i didnt correctly put the SA piece in the jflex file

I think this change (adding getFinalOffset to StandardTokenizer)
doesn't need a change to jflex?  (It's only if you edit
StandardTokenizerImpl.java).

Hmm another complexity is handling a field instance that produced no
tokens.  Currently, we do not increment the cumulative offset by +1 in
such cases.  But, for position increment gap we always add this gap in
between fields if any field from the past have produced a token.  I
added a couple test cases for this.

Also, I fixed a bug in how CharTokenizer was computing its final
offset.

Still todo:
  - add test cases to cover NOT_ANALYZED fields
  - fix contrib tokenizers to implement getFinalOffset


> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Reply via email to