DOAN DuyHai created CASSANDRA-12078:
---------------------------------------
Summary: [SASI] Move skip_stop_words filter BEFORE stemming
Key: CASSANDRA-12078
URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
Project: Cassandra
Issue Type: Improvement
Components: CQL
Environment: Cassandra 3.7, Cassandra 3.8
Reporter: DOAN DuyHai
Assignee: DOAN DuyHai
Attachments: patch.txt
Right now, if skip stop words and stemming are enabled, SASI will put stemming
in the filter pipeline BEFORE skip_stop_words:
{code:java}
private FilterPipelineTask getFilterPipeline()
{
FilterPipelineBuilder builder = new FilterPipelineBuilder(new
BasicResultFilters.NoOperation());
...
if (options.shouldStemTerms())
builder = builder.add("term_stemming", new
StemmingFilters.DefaultStemmingFilter(options.getLocale()));
if (options.shouldIgnoreStopTerms())
builder = builder.add("skip_stop_words", new
StopWordFilters.DefaultStopWordFilter(options.getLocale()));
return builder.build();
}
{code}
The problem is that stemming before removing stop words can yield wrong results.
I have an example:
{code:sql}
SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW
FILTERING;
{code}
*danse* = *dance* in English, and because of stemming, it becomes *dans* (the
final vowel is removed). Then skip stop words is applied. Unfortunately *dans*
= *in* in English, a stop word in French so it is removed completely.
In the end the query is equivalent to {{SELECT * FROM music.albums WHERE
country='France'}} and of course the results are wrong.
Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming
filter
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)