Hi Tomoko, Thanks for your answer.
So, after them, I have opened an issue with a patch attached: https://issues.apache.org/jira/browse/LUCENE-8937 Adrien Le dim. 28 juil. 2019 à 13:51, Michael Sokolov <msoko...@gmail.com> a écrit : > Oh sorry for jumping in with my irrelevant comment, you are right, of > course! > > On Sat, Jul 27, 2019, 10:36 PM Tomoko Uchida <tomoko.uchida.1...@gmail.com > > > wrote: > > > Let me just make things a bit clear... > > I think the concern here is that FrenchMinimalStemmer would remove the > > last "digit" from a token because of it does not check if the > > character is letter or not. > > e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer. > > > > To me, this behaviour is beyond stemming. > > > > Tomoko > > > > 2019年7月28日(日) 4:55 Michael Sokolov <msoko...@gmail.com>: > > > > > > I'm not so sure. I think the whole idea of having both stemmers is that > > the > > > minimal one does less than the light one. > > > > > > Removing the final character of a double letter suffix is going to > > > sacrifice some precision. For example mes/mess, ne/née, I'm sure there > > are > > > others. > > > > > > So having both options is helpful, I don't think it's a bug on the face > > of > > > it. However I didn't look closely at the code, so I'm not sure what the > > > intent is exactly. > > > > > > On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida < > > tomoko.uchida.1...@gmail.com> > > > wrote: > > > > > > > Hi Adrien, > > > > > > > > To me, it sounds simply a bug. Can you please open a JIRA (with a > > > > patch if possible)? > > > > > > > > Tomoko > > > > > > > > 2019年7月23日(火) 22:05 Adrien Gallou <adriengal...@gmail.com>: > > > > > > > > > > Hi, > > > > > > > > > > I'm using both light and minimal French stemmers and encountered an > > issue > > > > > when using the minimal stemmer. > > > > > > > > > > The light stemmer removes the last character of a word if the last > > two > > > > > characters are identical. > > > > > We can see that here: > > > > > > > > > > > > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263 > > > > > In this light stemmer, there is a check to avoid altering the token > > if > > > > the > > > > > token is a number. > > > > > > > > > > The minimal stemmer also removes the last character of a word if > the > > last > > > > > two characters are identical. > > > > > We can see that here: > > > > > > > > > > > > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77 > > > > > > > > > > But in this minimal stemmer there is no check to see if the > > character is > > > > a > > > > > letter or not. > > > > > So when we have numeric tokens with the last two characters > identical > > > > they > > > > > are altered. > > > > > > > > > > Is there a reason for this? > > > > > Should I file an issue on Jira to add this check? > > > > > > > > > > Thanks, > > > > > > > > > > Adrien Gallou > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >