[
https://issues.apache.org/jira/browse/CASSANDRA-11130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140686#comment-15140686
]
Sam Tunnicliffe commented on CASSANDRA-11130:
---------------------------------------------
[~xedin], mostly lgtm.
The new tests added at the end of {{SASIIndexTest::testPrefixSSTableLookup}}
are a bit counter-intuitive to the intent of this ticket. The restriction that
PREFIX indexes support EQ iff the analyzer is non-tokenizing is basically
enforced at the CQL layer by {{(SASI|Column)Index::supports}}, but this is
bypassed in the test by constructing the command directly. Which is fine in
itself, but we should also have a CQL based test which verifies and illustrates
the restriction. (Also, you duplicated the "key7" -> "Vijay" row, which looks
accidental?).
Regarding the changes to CONTAINS mode, there's a bug in the filtering when
LIKE '<term>' is used. When searching the MemIndex, {{getValueForExactKey}} is
used and works as expected, but after flushing incorrect results show up:
{code}
CREATE KEYSPACE ks WITH replication = {'class': 'SimpleStrategy',
'replication_factor': 1};
CREATE TABLE ks.t1(k int primary key, v text);
CREATE CUSTOM INDEX ON ks.t1(v) USING
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode' :
'CONTAINS' };
INSERT INTO ks.t1(k, v) VALUES (0, 'Pavel');
{code}
{code}
cqlsh>SELECT * FROM ks.t1 WHERE v LIKE 'Pav';
k | v
---+---
(0 rows)
# flush
cqlsh> SELECT * FROM ks.t1 WHERE v LIKE 'Pav';
k | v
---+-------
0 | Pavel
(1 rows)
{code}
The fix is trivial though:
{code}
diff --git a/src/java/org/apache/cassandra/index/sasi/plan/Expression.java
b/src/java/org/apache/cassandra/index/sasi/plan/Expression.java
index d77b505..0ce102d 100644
--- a/src/java/org/apache/cassandra/index/sasi/plan/Expression.java
+++ b/src/java/org/apache/cassandra/index/sasi/plan/Expression.java
@@ -281,6 +281,9 @@ public class Expression
break;
case MATCH:
+ isMatch = validator.compare(term, requestedValue) == 0;
+ break;
+
case CONTAINS:
isMatch = ByteBufferUtil.contains(term, requestedValue);
break;
{code}
Note that this isn't caught by any current unit test.
> [SASI Pre-QA] = semantics not respected when using StandardAnalyzer
> -------------------------------------------------------------------
>
> Key: CASSANDRA-11130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11130
> Project: Cassandra
> Issue Type: Bug
> Components: CQL
> Environment: Tested from build
> [CASSANDRA-11067|https://issues.apache.org/jira/browse/CASSANDRA-11067]
> Reporter: DOAN DuyHai
> Assignee: Pavel Yaskevich
> Fix For: 3.4
>
>
> Tested from build
> [CASSANDRA-11067|https://issues.apache.org/jira/browse/CASSANDRA-11067]
> {code:sql}
> CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'} AND durable_writes = true;
> CREATE TABLE music.albums (
> id int PRIMARY KEY,
> artist text,
> title1 text,
> title2 text
> );
> CREATE CUSTOM INDEX ON music.albums (title1) USING
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS =
> {'tokenization_skip_stop_words': 'true', 'analyzer_class':
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
> 'case_sensitive': 'false', 'mode': 'PREFIX', 'tokenization_enable_stemming':
> 'true'};
> CREATE CUSTOM INDEX ON music.albums (title2) USING
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS =
> {'tokenization_skip_stop_words': 'true', 'analyzer_class':
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
> 'case_sensitive': 'false', 'mode': 'CONTAINS',
> 'tokenization_enable_stemming': 'true'};
> INSERT INTO music.albums(id, artist, title1, title2)
> VALUES(1, 'Superpitcher', 'Yesterday', 'Yesterday');
> INSERT INTO music.albums(id, artist, title1, title2)
> VALUES(2, 'Hilary Duff', 'So Yesterday', 'So Yesterday');
> INSERT INTO music.albums(id, artist, title1, title2)
> VALUES(3, 'The Mr. T Experience', 'Yesterday Rules', 'Yesterday Rules');
> SELECT artist,title1 FROM music.albums WHERE title1='Yesterday';
> artist | title1
> ------------------------+----------------
> Superpitcher | Yesterday
> Hilary Duff | So Yesterday
> The Mr. T Experience | Yesterday Rules
>
> (3 rows)
> SELECT artist,title1 FROM music.albums WHERE title2='Yesterday';
> artist | title1
> ------------------------+----------------
> Superpitcher | Yesterday
> Hilary Duff | So Yesterday
> The Mr. T Experience | Yesterday Rules
>
> (3 rows)
> {code}
> The semantic of *=* is not respected. SASI should return only 1 row with
> exact match. Using *LIKE* would return all 3 rows. It does impact both
> *PREFIX* and *CONTAINS* mode. Using *NonTokenizerAnalyzer* return 1 row with
> exact match.
> So indeed, the semantics of *=* depends on the chosen analyzer, which is
> inconsistent. We should force *=* to be exact match no matter which analyzer
> is chosen.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)