[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

Luca Cavanna (JIRA) Fri, 14 Jun 2013 06:33:24 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Luca Cavanna updated LUCENE-5057:
---------------------------------

    Description: 
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed.

When I look at how snowball works, only one token is indexed, the stem, and 
that works great.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index 
"fiets" I get the only "fiets" indexed.

When I query for "fietsen whatever" I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained "fietsen", not the ones that 
originally contained "fiets", which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.

  was:
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index 
"fiets" I get the only "fiets" indexed.

When I query for "fietsen whatever" I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained "fietsen", not the ones that 
originally contained "fiets", which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.

    
> Hunspell stemmer generates multiple tokens
> ------------------------------------------
>
>                 Key: LUCENE-5057
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5057
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.3
>            Reporter: Luca Cavanna
>
> The hunspell stemmer seems to be generating multiple tokens: the original 
> token plus the available stems.
> It might be a good thing in some cases but it seems to be a different 
> behaviour compared to the other stemmers and causes problems as well. I would 
> rather have an option to decide whether it should output only the available 
> stems, or the stems plus the original token. I'm not sure though if it's 
> possible to have only a single stem indexed.
> When I look at how snowball works, only one token is indexed, the stem, and 
> that works great.
> Here is my issue: I have a query composed of multiple terms, which is 
> analyzed using stemming and a boolean query is generated out of it. All fine 
> when adding all clauses as should (OR operator), but if I add all clauses as 
> must (AND operator), then I can get back only the documents that contain the 
> stem originated by the exactly same original word.
> Example for the dutch language I'm working with: fiets (means bicycle in 
> dutch), its plural is fietsen.
> If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index 
> "fiets" I get the only "fiets" indexed.
> When I query for "fietsen whatever" I get the following boolean query: 
> field:fiets field:fietsen field:whatever.
> If I apply the AND operator and use must clauses for each subquery, then I 
> can only find the documents that originally contained "fietsen", not the ones 
> that originally contained "fiets", which is not really what stemming is about.
> Any thoughts on this? I would work out a patch, I'd just need some help 
> deciding the name of the option and what the default behaviour should be.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

Reply via email to