Luca Cavanna created LUCENE-5057:
------------------------------------

             Summary: Hunspell stemmer generates multiple tokens (original + 
stems)
                 Key: LUCENE-5057
                 URL: https://issues.apache.org/jira/browse/LUCENE-5057
             Project: Lucene - Core
          Issue Type: Improvement
    Affects Versions: 4.3
            Reporter: Luca Cavanna


The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index 
"fiets" I get the only "fiets" indexed.

When I query for "fietsen whatever" I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained "fietsen", not the ones that 
originally contained "fiets", which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to