uschindler commented on pull request #1966:
URL: https://github.com/apache/lucene-solr/pull/1966#issuecomment-705832388


   Hi maybe let me explain it:
   With the filter how you implemented it originally, you have the problem of 
the first token (that has a synonym) leading to a 0 position increment on the 
second token. If all token filters dropping tokens would be implemented like 
yours (just dropping away tkens), also StopFilter removing a stop word with 
synonym would lead to the fact. We also have many other token filters like that.
   
   So my suggestion here is the following: Please use the abstract 
implementation for those dropping filters, called `FilteringTokenFilter`. 
Although you think you need to look forward/backwards (which is indeed not 
possible), the correct way to handle this is to change the 
PositionIncrementAttribute on the token following the dropped tokens. If you 
remove a token with no increment (the second synonym), nothing needs to be done 
(increment is 0). If you remove the first token (which has increment 1 or maybe 
2), the increment needs to be moved to the following token. So if the very 
first token of the very first synonmy needs to be removed, the position 
increment (1) needs to be added to the second token. If the second token is a 
synonym, then it gets "upgraded" to first token (0 + 1 = 1). If the second 
token is a standard token, then it get increment (1 + 1 = 2). So the 
tokenstream knows that theres a gap at the first position.
   
   FilteringTokenFilter handles this by taking care of all positions and fixing 
those of following tokens. Read the source code, it just sums up increments of 
dropped tokens and adds them to the next token that is kept. This is 
thouroughly testes.
   
   The proposed TokenFilter here is nothing special, it just behaves like any 
other StopFilter-like filter. The bug you mentuon was fixed in StopFilter about 
10 years ago (this was one of my first commits with @rmuir back at that time).
   
   So please redo this PR:
   - Copypaste the TokenFilter as posted above (very easy)
   - Fix your tests to not expect a "dummy token".
   
   If you have a problem with your other filter, fix that one.
   
   I read your comment above, the discussion is still not correct. The 
tokenfilter graph must be correct. SpanQueries work correctly with 
FilteringTokenFilter.
   
   If you want a different behaviour, open a separate issue, as a change around 
that needs to be done in FilteringTokenFilter, not a subclass or a 
reimplemented filter.
   
   Sorry for being aggressive, but - sorry - the code you posted had really bad 
problems and was not accoring to any coding standards.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to