[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12073:
------------------------------------
    Attachment: patch_PREFIX_search_with_CONTAINS_mode.txt

Patch attached

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial 
> results
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12073
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: Cassandra 3.7
>            Reporter: DOAN DuyHai
>            Assignee: DOAN DuyHai
>             Fix For: 3.8
>
>         Attachments: patch_PREFIX_search_with_CONTAINS_mode.txt
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
>     id uuid PRIMARY KEY,
>     artist text,
>     country text,
>     quality text,
>     status text,
>     title text,
>     year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
> USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
> 'CONTAINS', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id                                   | artist    | country        | quality 
> | status    | title                     | year
> --------------------------------------+-----------+----------------+---------+-----------+---------------------------+------
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |            USA |  normal 
> |  Official |           Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |        Unknown |  normal 
> | Promotion |                Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |            USA |  normal 
> |  Official |   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |        Unknown |  normal 
> |  Official |                Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |            USA |  normal 
> |  Official |   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |          Italy |  normal 
> |  Official |                  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |            USA |  normal 
> |  Official |            Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |      Australia |  normal 
> |  Official |                     Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |            USA |  normal 
> |  Official | Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |        Unknown |  normal 
> |      null |                Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal 
> |  Official |                Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in 
> {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
>             {
>                 if (currentBlock == null)
>                     return endOfData();
>                 if (offset >= 0 && offset < currentBlock.termCount())
>                 {
>                     DataTerm currentTerm = currentBlock.getTerm(nextOffset());
>                     if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
>                         continue;
>                     // flip the flag right on the first bounds match
>                     // to avoid expensive comparisons
>                     checkLower = false;
>                     if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
>                         return endOfData();
>                     return currentTerm;
>                 }
>                 nextBlock();
>             }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether 
> the term match but also returns *false* when it's a *partial term* !
> {code:java}
>     public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
>     {
>         if (!hasUpper())
>             return true;
>         if (nonMatchingPartial(term))
>             return false;
>         int cmp = term.compareTo(validator, upper.value, false);
>         return cmp < 0 || cmp == 0 && upper.inclusive;
>     }
> {code}
> By debugging the OnDiskIndex data, I've found:
> {noformat}
> ...
> Data Term (partial ? false) : lady gaga. 0x0, TokenTree offset : 21120
> Data Term (partial ? true) : lady of bells. 0x0, TokenTree offset : 21360
> Data Term (partial ? false) : lady pank. 0x0, TokenTree offset : 21440
> ...
> {noformat}
> *lady gaga* did match but then *lady of bells* fails because it is a partial 
> match. Then SASI returns {{endOfData()}} instead continuing to check *lady 
> pank*
> One quick & dirty fix I've found is to add 
> {{!e.nonMatchingPartial(currentTerm)}} to the *if* condition
> {code:java}
>  if (checkUpper && !e.isUpperSatisfiedBy(currentTerm) && 
> !e.nonMatchingPartial(currentTerm))
>                         return endOfData();
> {code} 
> I feel it kinda dirty and I suspect this bug to has more impact than the 
> *PREFIX* case
> Wdyt [~xedin] [~beobal] [~jrwest] ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to