Question on stemming + synonyms and tokenizerFactory

Loren Fri, 14 Nov 2014 15:44:59 -0800

I have an analysis chain like this for some Spanish text:
standard asciifolding lowercase es_stop_filter es_stem_filter es_synonyms

With synonyms at the end, after all the other filters, I have to define my
synonyms in their stemmed, ASCII-folded, lowercase forms. So instead of
defining a synonym set like "vacuna, vacunación, inmunización", I have to
define it as "vacun, vacunacion, inmunizacion".

In the case of a very aggressive stemmer like Snowball for English, we
would have to define "intern, global" as a synonym mapping when we'd really
want to write "international, global".

This is a little counter-intuitive for the folks who define our synonyms,
as they think in dictionary terms and not stemmed tokens, and need to have
access to a "standard asciifolding lowercase es_stop_filter es_stem_filter"
analysis chain to apply everything but the synonym filter in order to see
what tokens to specify in the synonyms file.

In this blog post
<http://www.igate.com/iblog/index.php/stemming-and-synonyms-in-apache-solr/>
about
Solr, the author mentions that one could define a "custom tokenizer that
returns the stemmed form of words from the synonyms file" to get around
this. Is it possible to configure Elasticsearch this way?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a7009182-9577-4580-872a-1b121be3457d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Question on stemming + synonyms and tokenizerFactory

Reply via email to