[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luca Cavanna updated LUCENE-5057: --------------------------------- Summary: Hunspell stemmer generates multiple tokens (was: Hunspell stemmer generates multiple tokens (original + stems)) > Hunspell stemmer generates multiple tokens > ------------------------------------------ > > Key: LUCENE-5057 > URL: https://issues.apache.org/jira/browse/LUCENE-5057 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: 4.3 > Reporter: Luca Cavanna > > The hunspell stemmer seems to be generating multiple tokens: the original > token plus the available stems. > It might be a good thing in some cases but it seems to be a different > behaviour compared to the other stemmers and causes problems as well. I would > rather have an option to decide whether it should output only the available > stems, or the stems plus the original token. > Here is my issue: I have a query composed of multiple terms, which is > analyzed using stemming and a boolean query is generated out of it. All fine > when adding all clauses as should (OR operator), but if I add all clauses as > must (AND operator), then I can get back only the documents that contain the > stem originated by the exactly same original word. > Example for the dutch language I'm working with: fiets (means bicycle in > dutch), its plural is fietsen. > If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index > "fiets" I get the only "fiets" indexed. > When I query for "fietsen whatever" I get the following boolean query: > field:fiets field:fietsen field:whatever. > If I apply the AND operator and use must clauses for each subquery, then I > can only find the documents that originally contained "fietsen", not the ones > that originally contained "fiets", which is not really what stemming is about. > Any thoughts on this? I would work out a patch, I'd just need some help > deciding the name of the option and what the default behaviour should be. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org