[
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061558#comment-14061558
]
Robert Muir commented on LUCENE-5057:
-------------------------------------
Then thats something broken with that parser. Please open a separate issue for
that!
There is nothing wrong with this analysis component.
> Hunspell stemmer generates multiple tokens
> ------------------------------------------
>
> Key: LUCENE-5057
> URL: https://issues.apache.org/jira/browse/LUCENE-5057
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.3
> Reporter: Luca Cavanna
> Assignee: Adrien Grand
>
> The hunspell stemmer seems to be generating multiple tokens: the original
> token plus the available stems.
> It might be a good thing in some cases but it seems to be a different
> behaviour compared to the other stemmers and causes problems as well. I would
> rather have an option to decide whether it should output only the available
> stems, or the stems plus the original token. I'm not sure though if it's
> possible to have only a single stem indexed, which would be even better in my
> opinion. When I look at how snowball works only one token is indexed, the
> stem, and that works great. Probably there's something I'm missing in how
> hunspell works.
> Here is my issue: I have a query composed of multiple terms, which is
> analyzed using stemming and a boolean query is generated out of it. All fine
> when adding all clauses as should (OR operator), but if I add all clauses as
> must (AND operator), then I can get back only the documents that contain the
> stem originated by the exactly same original word.
> Example for the dutch language I'm working with: fiets (means bicycle in
> dutch), its plural is fietsen.
> If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index
> "fiets" I get the only "fiets" indexed.
> When I query for "fietsen whatever" I get the following boolean query:
> field:fiets field:fietsen field:whatever.
> If I apply the AND operator and use must clauses for each subquery, then I
> can only find the documents that originally contained "fietsen", not the ones
> that originally contained "fiets", which is not really what stemming is about.
> Any thoughts on this? I also wonder if it can be a dictionary issue since I
> see that different words that have the word "fiets" as root don't get the
> same stems, and using the AND operator at query time is a big issue.
> I would love to contribute on this and looking forward to your feedback.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]