DOAN DuyHai created CASSANDRA-12073:
---------------------------------------

             Summary: [SASI] PREFIX search on CONTAINS/NonTokenizer mode 
returns only partial results
                 Key: CASSANDRA-12073
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
             Project: Cassandra
          Issue Type: Bug
          Components: CQL
         Environment: Cassandra 3.7
            Reporter: DOAN DuyHai


{noformat}
cqlsh:music> CREATE TABLE music.albums (
    id uuid PRIMARY KEY,
    artist text,
    country text,
    quality text,
    status text,
    title text,
    year int
);

cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
'CONTAINS', 'analyzer_class': 
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
'case_sensitive': 'false'};

cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;

 id                                   | artist    | country        | quality | 
status    | title                     | year
--------------------------------------+-----------+----------------+---------+-----------+---------------------------+------
 372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |            USA |  normal |  
Official |           Red and Blue EP | 2006
 1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |        Unknown |  normal | 
Promotion |                Poker Face | 2008
 31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |            USA |  normal |  
Official |   The Cherrytree Sessions | 2009
 8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |        Unknown |  normal |  
Official |                Poker Face | 2009
 98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |            USA |  normal |  
Official |   Just Dance: The Remixes | 2008
 a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |          Italy |  normal |  
Official |                  The Fame | 2008
 849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |            USA |  normal |  
Official |            Christmas Tree | 2008
 4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |      Australia |  normal |  
Official |                     Eh Eh | 2009
 80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |            USA |  normal |  
Official | Just Dance Remixes Part 2 | 2008
 3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |        Unknown |  normal |  
    null |                Just Dance | 2008
 9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal |  
Official |                Just Dance | 2009

(11 rows)
{noformat}

*SASI* says that there are only 11 artists whose name starts with {{lady}}.

However, in the data set, there are:

* Lady Pank
* Lady Saw
* Lady Saw
* Ladyhawke
* Ladytron
* Ladysmith Black Mambazo
* Lady Gaga
* Lady Sovereign

etc ...

By debugging the source code, the issue is in 
{{OnDiskIndex.TermIterator::computeNext()}}

{code:java}
for (;;)
            {
                if (currentBlock == null)
                    return endOfData();

                if (offset >= 0 && offset < currentBlock.termCount())
                {
                    DataTerm currentTerm = currentBlock.getTerm(nextOffset());
                    if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
                        continue;

                    // flip the flag right on the first bounds match
                    // to avoid expensive comparisons
                    checkLower = false;

                    if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
                        return endOfData();
                    return currentTerm;
                }
                nextBlock();
            }
{code}

 So the {{endOfData()}} conditions are:

* currentBlock == null
* checkUpper && !e.isUpperSatisfiedBy(currentTerm)

The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether the 
term match but also returns *false* when it's a *partial term* !

{code:java}
    public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
    {
        if (!hasUpper())
            return true;

        if (nonMatchingPartial(term))
            return false;

        int cmp = term.compareTo(validator, upper.value, false);
        return cmp < 0 || cmp == 0 && upper.inclusive;
    }
{code}

By debugging the OnDiskIndex data, I've found:

{noformat}
...
Data Term (partial ? false) : lady gaga. 0x0, TokenTree offset : 21120
Data Term (partial ? true) : lady of bells. 0x0, TokenTree offset : 21360
Data Term (partial ? false) : lady pank. 0x0, TokenTree offset : 21440
...
{noformat}

*lady gaga* did match but then *lady of bells* fails because it is a partial 
match. Then SASI returns {{endOfData()}} instead continuing to check *lady pank*

One quick & dirty fix I've found is to add 
{{!e.nonMatchingPartial(currentTerm)}} to the *if* condition

{code:java}
 if (checkUpper && !e.isUpperSatisfiedBy(currentTerm) && 
!e.nonMatchingPartial(currentTerm))
                        return endOfData();
{code} 

I feel it kinda dirty and I suspect this bug to has more impact than the 
*PREFIX* case

Wdyt [~xedin] [~beobal] [~jrwest] ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to