[
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luca Cavanna updated LUCENE-5057:
---------------------------------
Summary: Hunspell stemmer generates multiple tokens (was: Hunspell stemmer
generates multiple tokens (original + stems))
> Hunspell stemmer generates multiple tokens
> ------------------------------------------
>
> Key: LUCENE-5057
> URL: https://issues.apache.org/jira/browse/LUCENE-5057
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.3
> Reporter: Luca Cavanna
>
> The hunspell stemmer seems to be generating multiple tokens: the original
> token plus the available stems.
> It might be a good thing in some cases but it seems to be a different
> behaviour compared to the other stemmers and causes problems as well. I would
> rather have an option to decide whether it should output only the available
> stems, or the stems plus the original token.
> Here is my issue: I have a query composed of multiple terms, which is
> analyzed using stemming and a boolean query is generated out of it. All fine
> when adding all clauses as should (OR operator), but if I add all clauses as
> must (AND operator), then I can get back only the documents that contain the
> stem originated by the exactly same original word.
> Example for the dutch language I'm working with: fiets (means bicycle in
> dutch), its plural is fietsen.
> If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index
> "fiets" I get the only "fiets" indexed.
> When I query for "fietsen whatever" I get the following boolean query:
> field:fiets field:fietsen field:whatever.
> If I apply the AND operator and use must clauses for each subquery, then I
> can only find the documents that originally contained "fietsen", not the ones
> that originally contained "fiets", which is not really what stemming is about.
> Any thoughts on this? I would work out a patch, I'd just need some help
> deciding the name of the option and what the default behaviour should be.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]