gus-asf edited a comment on pull request #1966:
URL: https://github.com/apache/lucene-solr/pull/1966#issuecomment-705761838


   HI @uschindler , your comments sound like you think I will be unwilling to 
take feedback. I'm not sure why you feel this way and seem to need to say the 
same thing repeatedly. This is broken out from the original contribution ticket 
for the very purpose of getting feedback. That said I do have a concern with 
the solution that FilteringTokenFilter provides... Consider this case:
   
   **Text**: "January 401(k) contribution"
   **Whitespace tok**: "January"(pi:1) "401(k)"(pi:1) "contribution"(pi:1)
   **PatternTypingFilter**: "January"(pi:1), "401(k)"(pi:1, flag:2, 
type:legal2_401_k), "contribution"(pi:1)
   **TokenAnalyzerFilter**: "january"(pi:1),"401"(pi:1, flag:2, 
type:legal2_401_k),"k"(pi:1, flag:2, type:legal2_401_k)",  "contribution"(pi:1) 
   
   _(for the record uwe's next 2 comments were unaware of the remainder of this 
comment because I had made an inadvertent submission of the comment at this 
point and he chose to respond before I finished, possibly because it was late 
evening where he is and he didn't want to stay up to wait for me to finish 
which would be understandable)_
   
   Note that TokenAnalyzerFilter is a new class that is and must always be a 
solr concept, (see the solr ticket for that) because it runs an analyzer 
defined in the SolrSchema against the tokens and emits 1 to N replacement 
tokens, (and yes I have adjusted positions, and mapped flags and types etc 
accordingly :) very happy to have feedback if I missed something there). In the 
ccase above I've presumed it includes a standard tokenizer and lowercase filter 
factory. Also See the description of the PatternTypingFilter ticket as well for 
more background on the motivation.
   
   **TypeAsSynonymFilter**: "january"(pi:1),"401"(pi:1, flag:2, 
type:legal2_401_k),"legal2_401_k"(pi:0) "k"(pi:1, flag:2, type:legal2_401_k)", 
"legal2_401_k"(pi:0), "contribution"(pi:1)
   
   Then:
   **DropIfFlaggedFilteringFilterVersion**:   
"january"(pi:1),"legal2_401_k"(pi:1), "legal2_401_k"(pi:1), "contribution"(pi:1)
   **DropIfFlaggedAsWritten**:   "january"(pi:1),"legal2_401_k"(pi:0), 
"legal2_401_k"(pi:0), "contribution"(pi:1)
   
   Either case causes an inaccuracy, The AQP will be adding a convenient user 
friendly syntax for span queries, and the user feedback generally ran in the 
direction of complaining when such queries missed things that were clearly 
within range and rarely complained if something one too far away came back. I 
agree that the case with the first token is handled by the 
FilteringTokenFilter, but it creates a silent problem in the use case that is 
important to the users. So what I've tried to avoid is a bad interaction with 
other expected configuration.
   
   If I have the synonym tokens gain a flag (further enhancement to 
TypeAsSynonymFilter) and create something to remove sequential tokens with the 
same flag then the FilteringTokenFilter solution probably would get to the 
proper position increments, but this additional functionality was not 
achievable within the original timeline. One possibility is that I go back to 
them and (as I told them might happen) indicate that changes will be necessary 
for community acceptance. They are aware of this possibility.
   
   I'll be the first to admit I've spent more time in the Solr areas of the 
project than building Lucene Token Filters so I am of course happy to have 
feedback from folks like you who have been deep into Lucene for many years. I 
certainly wish to improve how I use all aspects of the project including 
Lucene. Learning is never finished. My large comments in the code are there to 
draw attention to what I already knew was an iffy solution. That said I will 
note that for all test cases at the LOC (the solr submission has over 100 unit 
tests and the LOC's original dev branch has another 50 or so that dealt with 
their specific configurations) what I've built does work to their satisfaction, 
so let's try to be cordial and make things better. Improvements and review by 
others is one of the major benefits of community process.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to