Luca Cavanna created LUCENE-5057: ------------------------------------ Summary: Hunspell stemmer generates multiple tokens (original + stems) Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna
The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index "fiets" I get the only "fiets" indexed. When I query for "fietsen whatever" I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained "fietsen", not the ones that originally contained "fiets", which is not really what stemming is about. Any thoughts on this? I would work out a patch, I'd just need some help deciding the name of the option and what the default behaviour should be. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org