[jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens

Lukas Vlcek (JIRA) Thu, 05 Sep 2013 20:06:32 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759805#comment-13759805
 ]


Lukas Vlcek commented on LUCENE-5057:
-------------------------------------

Exactly Chris, that is what I tried to explain with that example. Also, I tried 
to provide some king of generalization (when you look into czech dictionary 
file you learn this example is somehow around edges given the words are only 
three letters, the article that I wrote contains different example with token 
having four and five letter).

But the point is (and the ticket title might be misleading in this context) 
that with more tokens being generated from hunspell token filter some Lucene 
queries are probably not working correctly when AND operator is used. At least 
that is how I understand the situation. So either we need to open a new ticket 
for Lucene which represents this issue or we need to reopen this ticket. What 
do you think? 
                
> Hunspell stemmer generates multiple tokens
> ------------------------------------------
>
>                 Key: LUCENE-5057
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5057
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.3
>            Reporter: Luca Cavanna
>            Assignee: Adrien Grand
>
> The hunspell stemmer seems to be generating multiple tokens: the original 
> token plus the available stems.
> It might be a good thing in some cases but it seems to be a different 
> behaviour compared to the other stemmers and causes problems as well. I would 
> rather have an option to decide whether it should output only the available 
> stems, or the stems plus the original token. I'm not sure though if it's 
> possible to have only a single stem indexed, which would be even better in my 
> opinion. When I look at how snowball works only one token is indexed, the 
> stem, and that works great. Probably there's something I'm missing in how 
> hunspell works.
> Here is my issue: I have a query composed of multiple terms, which is 
> analyzed using stemming and a boolean query is generated out of it. All fine 
> when adding all clauses as should (OR operator), but if I add all clauses as 
> must (AND operator), then I can get back only the documents that contain the 
> stem originated by the exactly same original word.
> Example for the dutch language I'm working with: fiets (means bicycle in 
> dutch), its plural is fietsen.
> If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index 
> "fiets" I get the only "fiets" indexed.
> When I query for "fietsen whatever" I get the following boolean query: 
> field:fiets field:fietsen field:whatever.
> If I apply the AND operator and use must clauses for each subquery, then I 
> can only find the documents that originally contained "fietsen", not the ones 
> that originally contained "fiets", which is not really what stemming is about.
> Any thoughts on this? I also wonder if it can be a dictionary issue since I 
> see that different words that have the word "fiets" as root don't get the 
> same stems, and using the AND operator at query time is a big issue.
> I would love to contribute on this and looking forward to your feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens

Reply via email to