uschindler edited a comment on pull request #1966:
URL: https://github.com/apache/lucene-solr/pull/1966#issuecomment-705832388


   Hi maybe let me explain it:
   With the filter how you implemented it originally, you have the problem of 
the first token (that has a synonym) leading to a 0 position increment on the 
second token. If all token filters dropping tokens would be implemented like 
yours (just dropping away tkens), also StopFilter removing a stop word with 
synonym would lead to the fact. We also have many other token filters like that.
   
   So my suggestion here is the following: Please use the abstract 
implementation for those dropping filters, called `FilteringTokenFilter`. 
Although you think you need to look forward/backwards (which is indeed not 
possible), the correct way to handle this is to change the 
PositionIncrementAttribute on the token following the dropped tokens. If you 
remove a token with no increment (the second synonym), nothing needs to be done 
(increment is 0). If you remove the first token (which has increment 1 or maybe 
2), the increment needs to be moved to the following token. So if the very 
first token of the very first synonmy needs to be removed, the position 
increment (1) needs to be added to the second token. If the second token is a 
synonym, then it gets "upgraded" to first token (0 + 1 = 1). If the second 
token is a standard token, then it get increment (1 + 1 = 2). So the 
tokenstream knows that theres a gap at the first position.
   
   FilteringTokenFilter handles this by taking care of all positions and fixing 
those of following tokens. Read the source code, it just sums up increments of 
dropped tokens and adds them to the next token that is kept. This is 
thouroughly tested, and whole of Lucene's Graph APIs and span queries rely on 
that.
   
   The proposed TokenFilter here is nothing special, it just behaves like any 
other StopFilter-like filter. The bug you mention was fixed in StopFilter about 
10 years ago (this was one of my first commits with @rmuir back at that time).
   
   So please redo this PR:
   - Copypaste the TokenFilter as posted above (very easy)
   - Fix your tests to not expect a "dummy token".
   
   If you have a problem with your other filter, fix that one, the problem is 
not in this filter if it correctly implements FilteringTokenFilter.
   
   I read your comment above, the workaround you propose is still not correct. 
The tokenfilter graph must be correct by fixing positions, not adding dummy 
tokens. SpanQueries work correctly with FilteringTokenFilter, whcih not only 
fixes the very first token (like yours), but also gaps coming later. Your token 
filter makes real gaps disappear.
   
   If for some reason, you don't want to have real gaps (increment > 1) in your 
stream, another tokenfilter making gaps sequential would be needed. So, if you 
want a different behaviour regarding gaps, open a separate issue, as a change 
around that needs to be done in FilteringTokenFilter, not a subclass or a 
reimplemented filter.
   
   Sorry for being aggressive, but - sorry - the code you posted had really bad 
problems and was not accoring to any coding standards.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to