Adrien Gallou created LUCENE-8937: ------------------------------------- Summary: Avoid agressive stemming on numbers in the FrenchMinimalStemmer Key: LUCENE-8937 URL: https://issues.apache.org/jira/browse/LUCENE-8937 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Gallou
Here is the discussion on the mailing list : [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser] The light stemmer removes the last character of a word if the last two characters are identical. We can see that here: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263 In this light stemmer, there is a check to avoid altering the token if the token is a number. The minimal stemmer also removes the last character of a word if the last two characters are identical. We can see that here: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77 But in this minimal stemmer there is no check to see if the character is a letter or not. So when we have numeric tokens with the last two characters identical they are altered. For example "1234567899" will be stemmed as "123456789". It could be great of it's not altered. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org