Luca Cavanna created LUCENE-5057:
------------------------------------
Summary: Hunspell stemmer generates multiple tokens (original +
stems)
Key: LUCENE-5057
URL: https://issues.apache.org/jira/browse/LUCENE-5057
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna
The hunspell stemmer seems to be generating multiple tokens: the original token
plus the available stems.
It might be a good thing in some cases but it seems to be a different behaviour
compared to the other stemmers and causes problems as well. I would rather have
an option to decide whether it should output only the available stems, or the
stems plus the original token.
Here is my issue: I have a query composed of multiple terms, which is analyzed
using stemming and a boolean query is generated out of it. All fine when adding
all clauses as should (OR operator), but if I add all clauses as must (AND
operator), then I can get back only the documents that contain the stem
originated by the exactly same original word.
Example for the dutch language I'm working with: fiets (means bicycle in
dutch), its plural is fietsen.
If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index
"fiets" I get the only "fiets" indexed.
When I query for "fietsen whatever" I get the following boolean query:
field:fiets field:fietsen field:whatever.
If I apply the AND operator and use must clauses for each subquery, then I can
only find the documents that originally contained "fietsen", not the ones that
originally contained "fiets", which is not really what stemming is about.
Any thoughts on this? I would work out a patch, I'd just need some help
deciding the name of the option and what the default behaviour should be.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]