gus-asf edited a comment on pull request #1966: URL: https://github.com/apache/lucene-solr/pull/1966#issuecomment-705761838
HI @uschindler , your comments sound like you think I will be unwilling to take feedback. I'm not sure why you feel this way and seem to need to say the same thing repeatedly. This is broken out from the original contribution ticket for the very purpose of getting feedback. That said I do have a concern with the solution that FilteringTokenFilter provides... Consider this case: **Text**: "January 401(k) contribution" **Whitespace tok**: "January"(pi:1) "401(k)"(pi:1) "contribution"(pi:1) **PatternTypingFilter**: "January"(pi:1), "401(k)"(pi:1, flag:2, type:legal2_401_k), "contribution"(pi:1) **TokenAnalyzerFilter**: "january"(pi:1),"401"(pi:1, flag:2, type:legal2_401_k),"k"(pi:1, flag:2, type:legal2_401_k)", "contribution"(pi:1) _(for the record uwe's next 2 comments were unaware of the remainder of this comment because I had made an inadvertent submission of the comment at this point and he chose to respond before I finished, possibly because it was late evening where he is and he didn't want to stay up to wait for me to finish which would be understandable)_ Note that TokenAnalyzerFilter is a new class that is and must always be a solr concept, (see the solr ticket for that) because it runs an analyzer defined in the SolrSchema against the tokens and emits 1 to N replacement tokens, (and yes I have adjusted positions, and mapped flags and types etc accordingly :) very happy to have feedback if I missed something there). In the ccase above I've presumed it includes a standard tokenizer and lowercase filter factory. Also See the description of the PatternTypingFilter ticket as well for more background on the motivation. **TypeAsSynonymFilter**: "january"(pi:1),"401"(pi:1, flag:2, type:legal2_401_k),"legal2_401_k"(pi:0) "k"(pi:1, flag:2, type:legal2_401_k)", "legal2_401_k"(pi:0), "contribution"(pi:1) Then: **DropIfFlaggedFilteringFilterVersion**: "january"(pi:1),"legal2_401_k"(pi:1), "legal2_401_k"(pi:1), "contribution"(pi:1) **DropIfFlaggedAsWritten**: "january"(pi:1),"legal2_401_k"(pi:0), "legal2_401_k"(pi:0), "contribution"(pi:1) Either case causes an inaccuracy, The AQP will be adding a convenient user friendly syntax for span queries, and the user feedback generally ran in the direction of complaining when such queries missed things that were clearly within range and rarely complained if something one too far away came back. I agree that the case with the first token is handled by the FilteringTokenFilter, but it creates a silent problem in the use case that is important to the users. So what I've tried to avoid is a bad interaction with other expected configuration. If I have the synonym tokens gain a flag (further enhancement to TypeAsSynonymFilter) and create something to remove sequential tokens with the same flag then the FilteringTokenFilter solution probably would get to the proper position increments, but this additional functionality was not achievable within the original timeline. One possibility is that I go back to them and (as I told them might happen) indicate that changes will be necessary for community acceptance. They are aware of this possibility. I'll be the first to admit I've spent more time in the Solr areas of the project than building Lucene Token Filters so I am of course happy to have feedback from folks like you who have been deep into Lucene for many years. I certainly wish to improve how I use all aspects of the project including Lucene. Learning is never finished. My large comments in the code are there to draw attention to what I already knew was an iffy solution. That said I will note that for all test cases at the LOC (the solr submission has over 100 unit tests and the LOC's original dev branch has another 50 or so that dealt with their specific configurations) what I've built does work to their satisfaction, so let's try to be cordial and make things better. Improvements and review by others is one of the major benefits of community process. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org