[jira] [Comment Edited] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-26 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350113#comment-15350113
 ] 

DOAN DuyHai edited comment on CASSANDRA-12073 at 6/26/16 1:32 PM:
--

Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped incorrectly

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{e::isLowerSatisfiedBy}} and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
return endOfData();

return currentTerm;
}
{code}

I'll also add an extra unit-test to cover the premature {{endOfData()}} case. 
To test the incorrect next block skipping we'll need to generate enough data so 
that the last term of next block is partial. I'm not sure how feasible it is.

What do you think @xedin, [~jrwest] ?


was (Author: doanduyhai):
Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped 

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{e::isLowerSatisfiedBy}} and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
   

[jira] [Comment Edited] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-26 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350113#comment-15350113
 ] 

DOAN DuyHai edited comment on CASSANDRA-12073 at 6/26/16 1:33 PM:
--

Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped incorrectly

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{Expression::isLowerSatisfiedBy}} and {{Expression::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
return endOfData();

return currentTerm;
}
{code}

I'll also add an extra unit-test to cover the premature {{endOfData()}} case. 
To test the incorrect next block skipping we'll need to generate enough data so 
that the last term of next block is partial. I'm not sure how feasible it is.

What do you think @xedin, [~jrwest] ?


was (Author: doanduyhai):
Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped incorrectly

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{e::isLowerSatisfiedBy}} and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && 

[jira] [Comment Edited] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-23 Thread Jordan West (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347099#comment-15347099
 ] 

Jordan West edited comment on CASSANDRA-12073 at 6/23/16 8:20 PM:
--

[~doanduyhai] the fix you mentioned looks correct. it should only impact the 
PREFIX type because of the implementation of `nonMatchingPartial`. 
Unfortunately this means we may do a bit more work in the index than desired 
but this is an acceptable consequence of the more complex query. 


was (Author: jrwest):
[~doanduyhai] the fix you mentioned looks correct. it should only impact the 
[code]PREFIX[/code] type because of the implementation of `nonMatchingPartial`. 
Unfortunately this means we may do a bit more work in the index than desired 
but this is an acceptable consequence of the more complex query. 

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial 
> results
> ---
>
> Key: CASSANDRA-12073
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
> Environment: Cassandra 3.7
>Reporter: DOAN DuyHai
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
> id uuid PRIMARY KEY,
> artist text,
> country text,
> quality text,
> status text,
> title text,
> year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
> USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
> 'CONTAINS', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id   | artist| country| quality 
> | status| title | year
> --+---++-+---+---+--
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |USA |  normal 
> |  Official |   Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |Unknown |  normal 
> | Promotion |Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |USA |  normal 
> |  Official |   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |Unknown |  normal 
> |  Official |Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |USA |  normal 
> |  Official |   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |  Italy |  normal 
> |  Official |  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |USA |  normal 
> |  Official |Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |  Australia |  normal 
> |  Official | Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |USA |  normal 
> |  Official | Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |Unknown |  normal 
> |  null |Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal 
> |  Official |Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in 
> {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
> {
> if (currentBlock == null)
> return endOfData();
> if (offset >= 0 && offset < currentBlock.termCount())
> {
> DataTerm currentTerm = currentBlock.getTerm(nextOffset());
> if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
> continue;
> // flip the flag right on the first bounds match
> // to avoid expensive comparisons
> checkLower = false;
> if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
> return endOfData();
> return currentTerm;
> }
> nextBlock();
> }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether 
> the