[
https://issues.apache.org/jira/browse/CASSANDRA-11122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DOAN DuyHai updated CASSANDRA-11122:
------------------------------------
Description:
I built the snapshot version taken from here:
https://github.com/xedin/cassandra/tree/CASSANDRA-11067
I create a tiny musical dataset with non-ascii characters (*cyrillic* actually)
and create a SASI index on the artist name.
SASI can find rows for the cyrillic name but strangely fails to index normal
ascii name (_'Object'_).
{code:sql}
CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'} AND durable_writes = true;
CREATE TABLE music.albums (
title text PRIMARY KEY,
artist text
);
INSERT INTO music.albums(artist,title) VALUES('Object','The Reflecting Skin');
INSERT INTO music.albums(artist,title) VALUES('Hayden','Mild and Hazy');
INSERT INTO music.albums(artist,title) VALUES('Самое Большое Простое
Число','СБПЧ Оркестр');
CREATE custom INDEX on music.albums(artist) USING
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'analyzer_class':
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
SELECT * FROM music.albums;
title | artist
---------------------+-----------------------------
The Reflecting Skin | Object
Mild and Hazy | Hayden
СБПЧ Оркестр | Самое Большое Простое Число
(3 rows)
SELECT * FROM music.albums WHERE artist='Самое Большое Простое Число';
title | artist
---------------------+-----------------------------
СБПЧ Оркестр | Самое Большое Простое Число
(1 rows)
SELECT * FROM music.albums WHERE artist='Hayden';
title | artist
---------------------+-----------------------------
Mild and Hazy | Hayden
(1 rows)
SELECT * FROM music.albums WHERE artist='Object';
title | artist
---------------------+-----------------------------
(0 rows)
SELECT * FROM music.albums WHERE artist like 'Ob%';
title | artist
---------------------+-----------------------------
(0 rows)
{code}
Strangely enough, after cleaning all the data and re-inserting without the
russian artist with cyrillic name, SASI does find _'Object_' ...
{code:sql}
DROP INDEX albums_artist_idx;
TRUNCATE TABLE albums;
INSERT INTO albums(artist,title) VALUES('Object','The Reflecting Skin');
INSERT INTO albums(artist,title) VALUES('Hayden','Mild and Hazy');
CREATE custom INDEX on music.albums(artist) USING
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'analyzer_class':
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
SELECT * FROM music.albums;
title | artist
---------------------+-----------------------------
The Reflecting Skin | Object
Mild and Hazy | Hayden
(2 rows)
SELECT * FROM music.albums WHERE artist='Object';
title | artist
---------------------+-----------------------------
The Reflecting Skin | Object
(1 rows)
SELECT * FROM music.albums WHERE artist LIKE 'Ob%';
title | artist
---------------------+-----------------------------
The Reflecting Skin | Object
(1 rows)
{code}
The behaviour is quite inconsistent. I can understand that SASI refuses to
index cyrillic character or issue exception when encountering non-ascii
characters (because we did not specify the locale) but it's very surprising
that the indexing fails for normal ascii characters like _Object_
Could it be that SASI start indexing the artist name by following table albums
token range order (hash of title) and it stops indexing after encountering the
cyrillic name ?
was:
I built the snapshot version taken from here:
https://github.com/xedin/cassandra/tree/CASSANDRA-11067
I create a tiny musical dataset with non-ascii characters (*cyrillic* actually)
and create a SASI index on the artist name.
SASI can find rows for the cyrillic name but strangely fails to index normal
ascii name (_'Object'_).
{code:sql}
CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'} AND durable_writes = true;
CREATE TABLE music.albums (
title text PRIMARY KEY,
artist text
);
INSERT INTO music.albums(artist,title) VALUES('Object','The Reflecting Skin');
INSERT INTO music.albums(artist,title) VALUES('Hayden','Mild and Hazy');
INSERT INTO music.albums(artist,title) VALUES('Самое Большое Простое
Число','СБПЧ Оркестр');
CREATE custom INDEX on music.albums(artist) USING
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'analyzer_class':
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
SELECT * FROM music.albums;
title | artist
---------------------+-----------------------------
The Reflecting Skin | Object
Mild and Hazy | Hayden
СБПЧ Оркестр | Самое Большое Простое Число
(3 rows)
SELECT * FROM albums WHERE artist='Самое Большое Простое Число';
title | artist
---------------------+-----------------------------
СБПЧ Оркестр | Самое Большое Простое Число
(1 rows)
SELECT * FROM albums WHERE artist='Hayden';
title | artist
---------------------+-----------------------------
Mild and Hazy | Hayden
(1 rows)
SELECT * FROM albums WHERE artist='Object';
title | artist
---------------------+-----------------------------
(0 rows)
SELECT * FROM albums WHERE artist like 'Ob%';
title | artist
---------------------+-----------------------------
(0 rows)
{code}
Strangely enough, after cleaning all the data and re-inserting without the
russian artist with cyrillic name, SASI does find _'Object_' ...
{code:sql}
DROP INDEX albums_artist_idx;
TRUNCATE TABLE albums;
INSERT INTO albums(artist,title) VALUES('Object','The Reflecting Skin');
INSERT INTO albums(artist,title) VALUES('Hayden','Mild and Hazy');
SELECT * FROM music.albums;
title | artist
---------------------+-----------------------------
The Reflecting Skin | Object
Mild and Hazy | Hayden
(2 rows)
SELECT * FROM albums WHERE artist='Object';
title | artist
---------------------+-----------------------------
The Reflecting Skin | Object
(1 rows)
SELECT * FROM albums WHERE artist LIKE 'Ob%';
title | artist
---------------------+-----------------------------
The Reflecting Skin | Object
(1 rows)
{code}
The behaviour is quite inconsistent. I can understand that SASI refuses to
index cyrillic character or issue exception when encountering non-ascii
characters (because we did not specify the locale) but it's very surprising
that the indexing fails for normal ascii characters like _Object_
Could it be that SASI start indexing the artist name by following table albums
token range order (hash of title) and it stops indexing after encountering the
cyrillic name ?
> SASI does not find term when indexing non-ascii character
> ---------------------------------------------------------
>
> Key: CASSANDRA-11122
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11122
> Project: Cassandra
> Issue Type: Bug
> Components: CQL
> Environment: Cassandra 3.4 SNAPSHOT
> Reporter: DOAN DuyHai
> Attachments: CASSANDRA-11122.patch
>
>
> I built the snapshot version taken from here:
> https://github.com/xedin/cassandra/tree/CASSANDRA-11067
> I create a tiny musical dataset with non-ascii characters (*cyrillic*
> actually) and create a SASI index on the artist name.
> SASI can find rows for the cyrillic name but strangely fails to index normal
> ascii name (_'Object'_).
> {code:sql}
> CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'} AND durable_writes = true;
> CREATE TABLE music.albums (
> title text PRIMARY KEY,
> artist text
> );
> INSERT INTO music.albums(artist,title) VALUES('Object','The Reflecting Skin');
> INSERT INTO music.albums(artist,title) VALUES('Hayden','Mild and Hazy');
> INSERT INTO music.albums(artist,title) VALUES('Самое Большое Простое
> Число','СБПЧ Оркестр');
> CREATE custom INDEX on music.albums(artist) USING
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
> 'analyzer_class':
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
> 'case_sensitive': 'false'};
> SELECT * FROM music.albums;
> title | artist
> ---------------------+-----------------------------
> The Reflecting Skin | Object
> Mild and Hazy | Hayden
> СБПЧ Оркестр | Самое Большое Простое Число
> (3 rows)
> SELECT * FROM music.albums WHERE artist='Самое Большое Простое Число';
> title | artist
> ---------------------+-----------------------------
> СБПЧ Оркестр | Самое Большое Простое Число
> (1 rows)
> SELECT * FROM music.albums WHERE artist='Hayden';
> title | artist
> ---------------------+-----------------------------
> Mild and Hazy | Hayden
> (1 rows)
> SELECT * FROM music.albums WHERE artist='Object';
> title | artist
> ---------------------+-----------------------------
> (0 rows)
> SELECT * FROM music.albums WHERE artist like 'Ob%';
> title | artist
> ---------------------+-----------------------------
> (0 rows)
> {code}
> Strangely enough, after cleaning all the data and re-inserting without the
> russian artist with cyrillic name, SASI does find _'Object_' ...
> {code:sql}
> DROP INDEX albums_artist_idx;
> TRUNCATE TABLE albums;
> INSERT INTO albums(artist,title) VALUES('Object','The Reflecting Skin');
> INSERT INTO albums(artist,title) VALUES('Hayden','Mild and Hazy');
> CREATE custom INDEX on music.albums(artist) USING
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
> 'analyzer_class':
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
> 'case_sensitive': 'false'};
> SELECT * FROM music.albums;
> title | artist
> ---------------------+-----------------------------
> The Reflecting Skin | Object
> Mild and Hazy | Hayden
> (2 rows)
> SELECT * FROM music.albums WHERE artist='Object';
> title | artist
> ---------------------+-----------------------------
> The Reflecting Skin | Object
> (1 rows)
> SELECT * FROM music.albums WHERE artist LIKE 'Ob%';
> title | artist
> ---------------------+-----------------------------
> The Reflecting Skin | Object
> (1 rows)
> {code}
> The behaviour is quite inconsistent. I can understand that SASI refuses to
> index cyrillic character or issue exception when encountering non-ascii
> characters (because we did not specify the locale) but it's very surprising
> that the indexing fails for normal ascii characters like _Object_
> Could it be that SASI start indexing the artist name by following table
> albums token range order (hash of title) and it stops indexing after
> encountering the cyrillic name ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)