[ https://issues.apache.org/jira/browse/LUCENE-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-3849: -------------------------------- Attachment: LUCENE-3849.patch Here's a patch. There are some things I don't like about it though. Again to explain the situation: * buggy today, we call end() and then read a dirty offsetAttribute. nothing clears atts before end(). * we need to support multiple removers in the chain, each applying their posInc just like they do in incrementToken * because of that the atts cannot be dirty. So in this patch, i just call clearAttributes() in tokenstream.end() by default instead of doing nothing. this works, except that means when IW consumes this 'final posinc' there is an OB1, because posIncAtt's default value is 1. I don't like that. alternatively we could have tokenstream explicitly set posIncrAtt to 0 in end() instead of clearAttributes()? I'm just wondering if thats any better really. Otherwise the patch is straightforward, with the exception of IW's built-in keywordtokenizer (StringField.java). that one is not actually setting end(), we were just relying upon dirty atts, so thats why i changed it. > position increments should be implemented by TokenStream.end() > -------------------------------------------------------------- > > Key: LUCENE-3849 > URL: https://issues.apache.org/jira/browse/LUCENE-3849 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 3.6, 4.0-ALPHA > Reporter: Robert Muir > Attachments: LUCENE-3849.patch > > > if you have pages of a book as multivalued fields, with the default position > increment gap > of analyzer.java (0), phrase queries won't work across pages if one ends with > stopword(s). > This is because the 'trailing holes' are not taken into account in end(). So > I think in > TokenStream.end(), subclasses of FilteringTokenFilter (e.g. stopfilter) > should do: > {code} > super.end(); > posIncAtt += skippedPositions; > {code} > One problem is that these filters need to 'add' to the posinc, but currently > nothing clears > the attributes for end() [they are dirty, except offset which is set by the > tokenizer]. > Also the indexer should be changed to pull posIncAtt from end(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org