[
https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pavel Yaskevich updated CASSANDRA-12078:
----------------------------------------
Issue Type: Bug (was: Improvement)
> [SASI] Move skip_stop_words filter BEFORE stemming
> --------------------------------------------------
>
> Key: CASSANDRA-12078
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
> Project: Cassandra
> Issue Type: Bug
> Components: sasi
> Environment: Cassandra 3.7, Cassandra 3.8
> Reporter: DOAN DuyHai
> Assignee: DOAN DuyHai
> Fix For: 3.7
>
> Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put
> stemming in the filter pipeline BEFORE skip_stop_words:
> {code:java}
> private FilterPipelineTask getFilterPipeline()
> {
> FilterPipelineBuilder builder = new FilterPipelineBuilder(new
> BasicResultFilters.NoOperation());
> ...
> if (options.shouldStemTerms())
> builder = builder.add("term_stemming", new
> StemmingFilters.DefaultStemmingFilter(options.getLocale()));
> if (options.shouldIgnoreStopTerms())
> builder = builder.add("skip_stop_words", new
> StopWordFilters.DefaultStopWordFilter(options.getLocale()));
> return builder.build();
> }
> {code}
> The problem is that stemming before removing stop words can yield wrong
> results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse'
> ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final
> vowel is removed). Then skip stop words is applied. Unfortunately *dans*
> (*in* in English) is a stop word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE
> country='France'}} and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE
> stemming filter
> /cc [~xedin] [~jrwest] [~beobal]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)