uschindler edited a comment on pull request #1966: URL: https://github.com/apache/lucene-solr/pull/1966#issuecomment-705832388
Hi maybe let me explain it: With the filter how you implemented it originally, you have the problem of the first token (that has a synonym) leading to a 0 position increment on the second token. If all token filters dropping tokens would be implemented like yours (just dropping away tkens), also StopFilter removing a stop word with synonym would lead to the fact. We also have many other token filters like that. So my suggestion here is the following: Please use the abstract implementation for those dropping filters, called `FilteringTokenFilter`. Although you think you need to look forward/backwards (which is indeed not possible), the correct way to handle this is to change the PositionIncrementAttribute on the token following the dropped tokens. If you remove a token with no increment (the second synonym), nothing needs to be done (increment is 0). If you remove the first token (which has increment 1 or maybe 2), the increment needs to be moved to the following token. So if the very first token of the very first synonmy needs to be removed, the position increment (1) needs to be added to the second token. If the second token is a synonym, then it gets "upgraded" to first token (0 + 1 = 1). If the second token is a standard token, then it get increment (1 + 1 = 2). So the tokenstream knows that theres a gap at the first position. FilteringTokenFilter handles this by taking care of all positions and fixing those of following tokens. Read the source code, it just sums up increments of dropped tokens and adds them to the next token that is kept. This is thouroughly tested, and whole of Lucene's Graph APIs and span queries rely on that. The proposed TokenFilter here is nothing special, it just behaves like any other StopFilter-like filter. The bug you mention was fixed in StopFilter about 10 years ago (this was one of my first commits with @rmuir back at that time). So please redo this PR: - Copypaste the TokenFilter as posted above (very easy) - Fix your tests to not expect a "dummy token". If you have a problem with your other filter, fix that one, the problem is not in this filter if it correctly implements FilteringTokenFilter. I read your comment above, the workaround you propose is still not correct. The tokenfilter graph must be correct by fixing positions, not adding dummy tokens. SpanQueries work correctly with FilteringTokenFilter, whcih not only fixes the very first token (like yours), but also gaps coming later. Your token filter makes real gaps disappear. If for some reason, you don't want to have real gaps (increment > 1) in your stream, another tokenfilter making gaps sequential would be needed. So, if you want a different behaviour regarding gaps, open a separate issue, as a change around that needs to be done in FilteringTokenFilter, not a subclass or a reimplemented filter. Sorry for being aggressive, but - sorry - the code you posted had really bad problems and was not accoring to any coding standards. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org