Re: Hunspell stemmer generates multiple tokens

oren bochman Fri, 07 Jun 2013 17:01:24 -0700

Multiple tokens seems to be a more flexible contract.

You might want to be able to match just the stem, both the exact token and  the 
stemmed token or just the exact term. So putting both in the index may be 
expedient, depending on the language.


Also there are  a number of common situations where document text can be 
stemmed more  accurately than query text. In such cases you might want to boost 
the stemmed token adaptively.

Sent from my iPhone

On Jun 7, 2013, at 16:16, Luca Cavanna <cavannal...@gmail.com> wrote:

> Hi,
> I just noticed that the HunspellStemmer outputs more than one tokens, the
> original word plus the stems as far as I understood.
> 
> This is not quite what I would expect and becomes tricky especially at
> query time. Using for instance elasticsearch to query a stemmed field, a
> boolean query would be generated, containing multiple clauses (one for each
> token generated by the stemmer) instead of just a clause with the stem that
> we expect to find in the index (if we indexed using stemming of course).
> 
> I would like to know if you think this is the correct behaviour and if this
> is something you are aware of. If I look at snowball for example, I see
> that only one token is generated.
> 
> 
> Thanks,
> Luca

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Hunspell stemmer generates multiple tokens

Reply via email to