[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653893#action_12653893 ]
Michael McCandless commented on LUCENE-1448: -------------------------------------------- {quote} What I'd like to work on soon is an efficient way to buffer attributes (maybe add methods to attribute that write into a bytebuffer). Then attributes can implement what variables need to be serialized and which ones don't. In that case we could add a finalOffset to OffsetAttribute that does not get serialiezd/deserialized. {quote} I like that (it'd make streams like CachingTokenFilter much more efficient). It'd also presumably lead to more efficiently serialized token streams. But: you'd still need a way in this model to serialize finalOffset, once, at the end? {quote} And possibly it might be worthwhile to have explicit states defined in a TokenStream that we can enforce with three methods: start(), increment(), end(). Then people would now if they have to do something at the end of a stream they have to do it in end(). {quote} This also seems good. So end() would be the obvious place to set the OffsetAttribute.finalOffset, PositionIncrementAttribute.positionIncrementGap, etc. OK I'm gonna assign this one to you, Michael ;) > add getFinalOffset() to TokenStream > ----------------------------------- > > Key: LUCENE-1448 > URL: https://issues.apache.org/jira/browse/LUCENE-1448 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, > LUCENE-1448.patch > > > If you add multiple Fieldable instances for the same field name to a > document, and you then index those fields with TermVectors storing offsets, > it's very likely the offsets for all but the first field instance will be > wrong. > This is because IndexWriter under the hood adds a cumulative base to the > offsets of each field instance, where that base is 1 + the endOffset of the > last token it saw when analyzing that field. > But this logic is overly simplistic. For example, if the WhitespaceAnalyzer > is being used, and the text being analyzed ended in 3 whitespace characters, > then that information is lost and then next field's offsets are then all 3 > too small. Similarly, if a StopFilter appears in the chain, and the last N > tokens were stop words, then the base will be 1 + the endOffset of the last > non-stopword token. > To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm > thinking by default it returns -1, which means "I don't know so you figure it > out", meaning we fallback to the faulty logic we have today. > This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]