uschindler commented on pull request #1966: URL: https://github.com/apache/lucene-solr/pull/1966#issuecomment-705832388
Hi maybe let me explain it: With the filter how you implemented it originally, you have the problem of the first token (that has a synonym) leading to a 0 position increment on the second token. If all token filters dropping tokens would be implemented like yours (just dropping away tkens), also StopFilter removing a stop word with synonym would lead to the fact. We also have many other token filters like that. So my suggestion here is the following: Please use the abstract implementation for those dropping filters, called `FilteringTokenFilter`. Although you think you need to look forward/backwards (which is indeed not possible), the correct way to handle this is to change the PositionIncrementAttribute on the token following the dropped tokens. If you remove a token with no increment (the second synonym), nothing needs to be done (increment is 0). If you remove the first token (which has increment 1 or maybe 2), the increment needs to be moved to the following token. So if the very first token of the very first synonmy needs to be removed, the position increment (1) needs to be added to the second token. If the second token is a synonym, then it gets "upgraded" to first token (0 + 1 = 1). If the second token is a standard token, then it get increment (1 + 1 = 2). So the tokenstream knows that theres a gap at the first position. FilteringTokenFilter handles this by taking care of all positions and fixing those of following tokens. Read the source code, it just sums up increments of dropped tokens and adds them to the next token that is kept. This is thouroughly testes. The proposed TokenFilter here is nothing special, it just behaves like any other StopFilter-like filter. The bug you mentuon was fixed in StopFilter about 10 years ago (this was one of my first commits with @rmuir back at that time). So please redo this PR: - Copypaste the TokenFilter as posted above (very easy) - Fix your tests to not expect a "dummy token". If you have a problem with your other filter, fix that one. I read your comment above, the discussion is still not correct. The tokenfilter graph must be correct. SpanQueries work correctly with FilteringTokenFilter. If you want a different behaviour, open a separate issue, as a change around that needs to be done in FilteringTokenFilter, not a subclass or a reimplemented filter. Sorry for being aggressive, but - sorry - the code you posted had really bad problems and was not accoring to any coding standards. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org