Mark Harwood created LUCENE-8876:
------------------------------------

             Summary: EnglishMinimalStemmer does not implement s-stemmer paper 
correctly?
                 Key: LUCENE-8876
                 URL: https://issues.apache.org/jira/browse/LUCENE-8876
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
            Reporter: Mark Harwood


The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and 
employees.

The [original 
paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf]]
 has this table of rules:

!https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png!

The notes accompanying the table state :
{quote}"the first applicable rule encountered is the only one used"
{quote}
 

For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer 
misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes 
!= tomato}}. The {{oes}} and {{ees}} suffixes are left intact.

"The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 in 
the table depending on if you take {{applicable}} to mean "the THEN part of the 
rule has fired" or just that the suffix was referenced in the rule. 
EnglishMinimalStemmer has assumed the latter and I think it should be the 
former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove any 
trailing S). That's certainly the conclusion I came to independently testing on 
real data.

There are some additional changes I'd like to see in a plural stemmer but I 
won't list them here - the focus should be making the code here match the 
original paper it references.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to