[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713487#comment-13713487 ]
Luca Cavanna commented on LUCENE-5057: -------------------------------------- Thanks Adrien for looking into this, nice explanation! > Hunspell stemmer generates multiple tokens > ------------------------------------------ > > Key: LUCENE-5057 > URL: https://issues.apache.org/jira/browse/LUCENE-5057 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: 4.3 > Reporter: Luca Cavanna > Assignee: Adrien Grand > > The hunspell stemmer seems to be generating multiple tokens: the original > token plus the available stems. > It might be a good thing in some cases but it seems to be a different > behaviour compared to the other stemmers and causes problems as well. I would > rather have an option to decide whether it should output only the available > stems, or the stems plus the original token. I'm not sure though if it's > possible to have only a single stem indexed, which would be even better in my > opinion. When I look at how snowball works only one token is indexed, the > stem, and that works great. Probably there's something I'm missing in how > hunspell works. > Here is my issue: I have a query composed of multiple terms, which is > analyzed using stemming and a boolean query is generated out of it. All fine > when adding all clauses as should (OR operator), but if I add all clauses as > must (AND operator), then I can get back only the documents that contain the > stem originated by the exactly same original word. > Example for the dutch language I'm working with: fiets (means bicycle in > dutch), its plural is fietsen. > If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index > "fiets" I get the only "fiets" indexed. > When I query for "fietsen whatever" I get the following boolean query: > field:fiets field:fietsen field:whatever. > If I apply the AND operator and use must clauses for each subquery, then I > can only find the documents that originally contained "fietsen", not the ones > that originally contained "fiets", which is not really what stemming is about. > Any thoughts on this? I also wonder if it can be a dictionary issue since I > see that different words that have the word "fiets" as root don't get the > same stems, and using the AND operator at query time is a big issue. > I would love to contribute on this and looking forward to your feedback. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org