[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Sam Tunnicliffe (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113677#comment-15113677
 ] 

Sam Tunnicliffe commented on CASSANDRA-10661:
-

[~xedin] SGTM!

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8028) Unable to compute when histogram overflowed

2016-01-23 Thread Navjyot Nishant (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113715#comment-15113715
 ] 

Navjyot Nishant commented on CASSANDRA-8028:


Hi All,

We are getting similar issue while autocompaction is running on few of our 
nodes. Following is the error being logged, can someone please suggest what is 
causing this and how to resolve it? We use Cassandra 2.1.9. Please let me know 
if further information is required.

Error:

ERROR [CompactionExecutor:3] 2016-01-23 11:54:50,198 CassandraDaemon.java:223 - 
Exception in thread Thread[CompactionExecutor:3,1,main]
java.lang.IllegalStateException: Unable to compute ceiling for max when 
histogram overflowed
at 
org.apache.cassandra.utils.EstimatedHistogram.mean(EstimatedHistogram.java:203) 
~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.io.sstable.metadata.StatsMetadata.getEstimatedDroppableTombstoneRatio(StatsMetadata.java:98)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.io.sstable.SSTableReader.getEstimatedDroppableTombstoneRatio(SSTableReader.java:1987)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.AbstractCompactionStrategy.worthDroppingTombstones(AbstractCompactionStrategy.java:370)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy.getNextBackgroundSSTables(SizeTieredCompactionStrategy.java:96)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy.getNextBackgroundTask(SizeTieredCompactionStrategy.java:179)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
~[na:1.7.0_51]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
~[na:1.7.0_51]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
~[na:1.7.0_51]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]


> Unable to compute when histogram overflowed
> ---
>
> Key: CASSANDRA-8028
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8028
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
> Environment: Linux
>Reporter: Gianluca Borello
>Assignee: Carl Yeksigian
> Fix For: 2.1.3
>
> Attachments: 8028-2.1-clean.txt, 8028-2.1-v2.txt, 8028-2.1.txt, 
> 8028-trunk.txt, sstable-histogrambuster.tar.bz2
>
>
> It seems like with 2.1.0 histograms can't be computed most of the times:
> $ nodetool cfhistograms draios top_files_by_agent1
> nodetool: Unable to compute when histogram overflowed
> See 'nodetool help' or 'nodetool help '.
> I can probably find a way to attach a .cql script to reproduce it, but I 
> suspect it must be obvious to replicate it as it happens on more than 50% of 
> my column families.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11063) Unable to compute ceiling for max when histogram overflowed

2016-01-23 Thread Navjyot Nishant (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navjyot Nishant updated CASSANDRA-11063:

Description: 
Issue https://issues.apache.org/jira/browse/CASSANDRA-8028 seems related with 
error we are getting. But we are getting this with Cassandra 2.1.9 when 
autocompaction is running it keeps throwing following errors, we are unsure if 
its a bug or can be resolved, please suggest.

WARN  [CompactionExecutor:3] 2016-01-23 13:30:40,907 SSTableWriter.java:240 - 
Compacting large partition gccatlgsvcks/category_name_dedup:66611300 (138152195 
bytes)
ERROR [CompactionExecutor:1] 2016-01-23 13:30:50,267 CassandraDaemon.java:223 - 
Exception in thread Thread[CompactionExecutor:1,1,main]
java.lang.IllegalStateException: Unable to compute ceiling for max when 
histogram overflowed
at 
org.apache.cassandra.utils.EstimatedHistogram.mean(EstimatedHistogram.java:203) 
~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.io.sstable.metadata.StatsMetadata.getEstimatedDroppableTombstoneRatio(StatsMetadata.java:98)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.io.sstable.SSTableReader.getEstimatedDroppableTombstoneRatio(SSTableReader.java:1987)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.AbstractCompactionStrategy.worthDroppingTombstones(AbstractCompactionStrategy.java:370)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy.getNextBackgroundSSTables(SizeTieredCompactionStrategy.java:96)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy.getNextBackgroundTask(SizeTieredCompactionStrategy.java:179)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
~[na:1.7.0_51]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
~[na:1.7.0_51]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
~[na:1.7.0_51]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]


Additional info:
cfstats is running fine for that table...

~ $ nodetool cfstats gccatlgsvcks.category_name_dedup
Keyspace: gccatlgsvcks
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: category_name_dedup
SSTable count: 6
Space used (live): 836089073
Space used (total): 836089073
Space used by snapshots (total): 3621519
Off heap memory used (total): 6925736
SSTable Compression Ratio: 0.03725398763856016
Number of keys (estimate): 3004
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used: 5240
Bloom filter off heap memory used: 5192
Index summary off heap memory used: 1200
Compression metadata off heap memory used: 6919344
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 30753941057
Compacted partition mean bytes: 8352388
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0


  was:
Issue https://issues.apache.org/jira/browse/CASSANDRA-8028 seems related with 
error we are getting. But we are getting this with Cassandra 2.1.9 when 
autocompaction is running it keeps throwing following errors, we are unsure if 
its a bug or can be resolved, please suggest.

ERROR [CompactionExecutor:3] 2016-01-23 11:52:50,197 CassandraDaemon.java:223 - 
Exception in thread Thread[CompactionExecutor:3,1,main]
java.lang.IllegalStateException: Unable to compute ceiling for max when 
histogram overflowed
   

[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113690#comment-15113690
 ] 

Pavel Yaskevich commented on CASSANDRA-10661:
-

[~beobal] Awesome, will try to do everything tomorrow, thanks!

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112994#comment-15112994
 ] 

Pavel Yaskevich edited comment on CASSANDRA-10661 at 1/23/16 11:04 AM:
---

[~beobal] How about `unfilteredCluster`? Since we are on the same page about 
this, here is what I'm thinking - we are going to update README.md we have in 
xedin/sasi and I'm going to put it into doc/SASI.md, squash all 17 commits into 
one and push to trunk, sounds good?


was (Author: xedin):
[~beobal] How about `unfilteredCluster`? Since we are on the same page about 
this, here is what I'm thinking - we are going to avoid README.md we have in 
xedin/sasi and I'm going to put it into doc/SASI.md, squash all 17 commits into 
one and push to trunk, sounds good?

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8028) Unable to compute when histogram overflowed

2016-01-23 Thread Navjyot Nishant (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113748#comment-15113748
 ] 

Navjyot Nishant commented on CASSANDRA-8028:


I have created https://issues.apache.org/jira/browse/CASSANDRA-11063 to track 
this issue.

> Unable to compute when histogram overflowed
> ---
>
> Key: CASSANDRA-8028
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8028
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
> Environment: Linux
>Reporter: Gianluca Borello
>Assignee: Carl Yeksigian
> Fix For: 2.1.3
>
> Attachments: 8028-2.1-clean.txt, 8028-2.1-v2.txt, 8028-2.1.txt, 
> 8028-trunk.txt, sstable-histogrambuster.tar.bz2
>
>
> It seems like with 2.1.0 histograms can't be computed most of the times:
> $ nodetool cfhistograms draios top_files_by_agent1
> nodetool: Unable to compute when histogram overflowed
> See 'nodetool help' or 'nodetool help '.
> I can probably find a way to attach a .cql script to reproduce it, but I 
> suspect it must be obvious to replicate it as it happens on more than 50% of 
> my column families.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-11063) Unable to compute ceiling for max when histogram overflowed

2016-01-23 Thread Navjyot Nishant (JIRA)
Navjyot Nishant created CASSANDRA-11063:
---

 Summary: Unable to compute ceiling for max when histogram 
overflowed
 Key: CASSANDRA-11063
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11063
 Project: Cassandra
  Issue Type: Bug
  Components: Compaction
 Environment: Cassandra 2.1.9 on RHEL
Reporter: Navjyot Nishant


Issue https://issues.apache.org/jira/browse/CASSANDRA-8028 seems related with 
error we are getting. But we are getting this with Cassandra 2.1.9 when 
autocompaction is running it keeps throwing following errors, we are unsure if 
its a bug or can be resolved, please suggest.

ERROR [CompactionExecutor:3] 2016-01-23 11:52:50,197 CassandraDaemon.java:223 - 
Exception in thread Thread[CompactionExecutor:3,1,main]
java.lang.IllegalStateException: Unable to compute ceiling for max when 
histogram overflowed
at 
org.apache.cassandra.utils.EstimatedHistogram.mean(EstimatedHistogram.java:203) 
~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.io.sstable.metadata.StatsMetadata.getEstimatedDroppableTombstoneRatio(StatsMetadata.java:98)
 ~[apache-cassandra-2.1.9.jar:2.1
at 
org.apache.cassandra.io.sstable.SSTableReader.getEstimatedDroppableTombstoneRatio(SSTableReader.java:1987)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.AbstractCompactionStrategy.worthDroppingTombstones(AbstractCompactionStrategy.java:370)
 ~[apache-cassandra-2.1.
at 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy.getNextBackgroundSSTables(SizeTieredCompactionStrategy.java:96)
 ~[apache-cassandra
at 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy.getNextBackgroundTask(SizeTieredCompactionStrategy.java:179)
 ~[apache-cassandra-2.
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
 ~[apache-cassandra-2.1.9.j
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
 ~[apache-cassandra-2.1.9.jar:2.
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
~[na:1.7.0_51]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
~[na:1.7.0_51]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
~[na:1.7.0_51]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11063) Unable to compute ceiling for max when histogram overflowed

2016-01-23 Thread Navjyot Nishant (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navjyot Nishant updated CASSANDRA-11063:

Description: 
Issue https://issues.apache.org/jira/browse/CASSANDRA-8028 seems related with 
error we are getting. But we are getting this with Cassandra 2.1.9 when 
autocompaction is running it keeps throwing following errors, we are unsure if 
its a bug or can be resolved, please suggest.

WARN  [CompactionExecutor:3] 2016-01-23 13:30:40,907 SSTableWriter.java:240 - 
Compacting large partition gccatlgsvcks/category_name_dedup:66611300 (138152195 
bytes)
ERROR [CompactionExecutor:1] 2016-01-23 13:30:50,267 CassandraDaemon.java:223 - 
Exception in thread Thread[CompactionExecutor:1,1,main]
java.lang.IllegalStateException: Unable to compute ceiling for max when 
histogram overflowed
at 
org.apache.cassandra.utils.EstimatedHistogram.mean(EstimatedHistogram.java:203) 
~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.io.sstable.metadata.StatsMetadata.getEstimatedDroppableTombstoneRatio(StatsMetadata.java:98)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.io.sstable.SSTableReader.getEstimatedDroppableTombstoneRatio(SSTableReader.java:1987)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.AbstractCompactionStrategy.worthDroppingTombstones(AbstractCompactionStrategy.java:370)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy.getNextBackgroundSSTables(SizeTieredCompactionStrategy.java:96)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy.getNextBackgroundTask(SizeTieredCompactionStrategy.java:179)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
 ~[apache-cassandra-2.1.9.jar:2.1.9]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
~[na:1.7.0_51]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
~[na:1.7.0_51]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
~[na:1.7.0_51]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]


Additional info:

cfstats is running fine for that table...

~ $ nodetool cfstats gccatlgsvcks.category_name_dedup
Keyspace: gccatlgsvcks
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: category_name_dedup
SSTable count: 6
Space used (live): 836314727
Space used (total): 836314727
Space used by snapshots (total): 3621519
Off heap memory used (total): 6930368
SSTable Compression Ratio: 0.03725358753117693
Number of keys (estimate): 3004
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used: 5240
Bloom filter off heap memory used: 5192
Index summary off heap memory used: 1200
Compression metadata off heap memory used: 6923976
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 30753941057
Compacted partition mean bytes: 8352388
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0


cfhistograms is also running fine...

~ $ nodetool cfhistograms gccatlgsvcks category_name_dedup
gccatlgsvcks/category_name_dedup histograms
Percentile  SSTables Write Latency  Read LatencyPartition Size  
  Cell Count
  (micros)  (micros)   (bytes)
50% 0.00  0.00  0.00  1109  
  20
75% 0.00  0.00  0.00  2299  
  42
95% 0.00 

[jira] [Comment Edited] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113836#comment-15113836
 ] 

Jack Krupansky edited comment on CASSANDRA-10661 at 1/23/16 6:57 PM:
-

Is there also a way to query a SASI-indexed column by exact value? I mean, it 
seems as if by enabling prefix or contains, that it will always query by prefix 
or contains. For example, if I want to query for full first name, like where 
their full first name really is "J" and not get "John" and "James" as well, 
while at other times I am indeed looking for names starting with a prefix of 
"Jo" for "John", "Joseph", etc.

Or, can I indeed have two indexes on a single column, one a traditional exact 
match, and one a prefix match. Hmmm... in which case, which gets used if I just 
specify a column name?

CREATE INDEX first_name_full ON mytable (first_name)...
CREATE CUSTOM INDEX first_name_prefix ON mytable (first_name)...

(I may be confused here - can you specify an index name in place of a column 
name in a relation in a SELECT/WHERE clause (SELECT... WHERE... 
first_name_exact = 'Joe')? I don't see any doc/spec that indicates that you 
can. I'm not sure why I thought that you could. But I don't see any code that 
detects and fails on this case at CREATE INDEX time. The code checks for 
"everything but name" rather than detecting two non-keys/values indexes on the 
same column.)

It would be good to have an example that illustrates this. In fact, I would 
argue that first and last names are perfect examples of where you really do 
need to query on both exact match and partial match. In fact, I'm not sure I 
can think of any examples of non-tokenized text fields where you don't want to 
reserve the ability to find an exact match even if you do need partial matches 
for some queries.

Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

Are there any use cases of traditional Cassandra indexes which shouldn't almost 
automatically be converted to SPARSE. After all, the current recommended best 
practice is to avoid secondary indexes where the column cardinality is either 
very high or very low, which seems to be a match for SPARSE, although the 
precise meaning of SPARSE is still a bit fuzzy for me.

Maybe, for the first_name use case I mentioned the user would be better off 
with a first_name Materialized View using first_name in the PK instead of the 
SPARSE SASI index. In fact, by placing first_name in the partition key of the 
MV I could assure that all base table rows with the same first name would be on 
the same node.

If all of that is true, we will need to give users some decent guidance on when 
to use SPARSE SASI vs. MV (vs. classic secondary... or even DSE Search.)


was (Author: jkrupan):
Is there also a way to query a SASI-indexed column by exact value? I mean, it 
seems as if by enabling prefix or contains, that it will always query by prefix 
or contains. For example, if I want to query for full first name, like where 
their full first name really is "J" and not get "John" and "James" as well, 
while at other times I am indeed looking for names starting with a prefix of 
"Jo" for "John", "Joseph", etc.

Or, can I indeed have two indexes on a single column, one a traditional exact 
match, and one a prefix match. Hmmm... in which case, which gets used if I just 
specify a column name?

CREATE INDEX first_name_full ON mytable (first_name)...
CREATE CUSTOM INDEX first_name_prefix ON mytable (first_name)...

It would be good to have an example that illustrates this. In fact, I would 
argue that first and last names are perfect examples of where you really do 
need to query on both exact match and partial match. In fact, I'm not sure I 
can think of any examples of non-tokenized text fields where you don't want to 
reserve the ability to find an exact match even if you do need partial matches 
for some queries.

Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

Are there any use cases of traditional Cassandra indexes which shouldn't almost 
automatically be converted to SPARSE. After all, the current recommended best 
practice is to avoid secondary indexes where the column cardinality is either 
very high or very low, which seems to be a match for SPARSE, although the 
precise meaning of SPARSE is still a bit fuzzy for me.

Maybe, for the first_name use case I mentioned the user would be better off 
with a first_name Materialized View using first_name in the PK instead of the 
SPARSE SASI index. In fact, by placing first_name in the partition key of the 
MV 

[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jordan West (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113918#comment-15113918
 ] 

Jordan West commented on CASSANDRA-10661:
-

bq. Is there also a way to query a SASI-indexed column by exact value? I mean, 
it seems as if by enabling prefix or contains, that it will always query by 
prefix or contains. For example, if I want to query for full first name, like 
where their full first name really is "J" and not get "John" and "James" as 
well, while at other times I am indeed looking for names starting with a prefix 
of "Jo" for "John", "Joseph", etc.

The example is correct, but this is not a limitation of SASI, its a limitation 
in CQL, and we decided not to further extend the grammar, since we have already 
had to scale back our grammar changes to later phases (removing OR, grouping, 
and != support for now). Ideally, CQL would support a `LIKE` operator similar 
to SQL, and depending on if the index was created with `PREFIX` or `CONTAINS` 
we would allow/disallow forms such as `%Jo%` or `_j%`. 

bq. Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

It does, but so are all queries on numerical data, which thinking about it, may 
make the `PREFIX` option confusing for numeric types. SPARSE is intended to 
improve query performance on numerical data where there are a large number of 
terms (e.g. timestamps), but small number of keys per term (e.g. some 
timeseries data).  `SPARSE` should not be used on every numerical column, and 
for most non-numerical data is not an ideal setting either. For example, in a 
large data set of first names the number of names will be small compared to the 
number of keys, and given the distribution of first names using SPARSE will 
increase the size of the index and at best have zero effect on query 
performance, but may hurt it.





 

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jordan West (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113918#comment-15113918
 ] 

Jordan West edited comment on CASSANDRA-10661 at 1/23/16 7:42 PM:
--

bq. Is there also a way to query a SASI-indexed column by exact value? I mean, 
it seems as if by enabling prefix or contains, that it will always query by 
prefix or contains. For example, if I want to query for full first name, like 
where their full first name really is "J" and not get "John" and "James" as 
well, while at other times I am indeed looking for names starting with a prefix 
of "Jo" for "John", "Joseph", etc.

The example is correct, but this is not a limitation of SASI, its a limitation 
in CQL, and we decided not to further extend the grammar, since we have already 
had to scale back our grammar changes to later phases (removing OR, grouping, 
and != support for now). Ideally, `=` would mean exact match and CQL would 
support a `LIKE` operator similar to SQL, and depending on if the index was 
created with `PREFIX` or `CONTAINS` we would allow/disallow forms such as 
`%Jo%` or `_j%`. 

bq. Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

It does, but so are all queries on numerical data, which thinking about it, may 
make the `PREFIX` option confusing for numeric types. SPARSE is intended to 
improve query performance on numerical data where there are a large number of 
terms (e.g. timestamps), but small number of keys per term (e.g. some 
timeseries data).  `SPARSE` should not be used on every numerical column, and 
for most non-numerical data is not an ideal setting either. For example, in a 
large data set of first names the number of names will be small compared to the 
number of keys, and given the distribution of first names using SPARSE will 
increase the size of the index and at best have zero effect on query 
performance, but may hurt it.





 


was (Author: jrwest):
bq. Is there also a way to query a SASI-indexed column by exact value? I mean, 
it seems as if by enabling prefix or contains, that it will always query by 
prefix or contains. For example, if I want to query for full first name, like 
where their full first name really is "J" and not get "John" and "James" as 
well, while at other times I am indeed looking for names starting with a prefix 
of "Jo" for "John", "Joseph", etc.

The example is correct, but this is not a limitation of SASI, its a limitation 
in CQL, and we decided not to further extend the grammar, since we have already 
had to scale back our grammar changes to later phases (removing OR, grouping, 
and != support for now). Ideally, CQL would support a `LIKE` operator similar 
to SQL, and depending on if the index was created with `PREFIX` or `CONTAINS` 
we would allow/disallow forms such as `%Jo%` or `_j%`. 

bq. Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

It does, but so are all queries on numerical data, which thinking about it, may 
make the `PREFIX` option confusing for numeric types. SPARSE is intended to 
improve query performance on numerical data where there are a large number of 
terms (e.g. timestamps), but small number of keys per term (e.g. some 
timeseries data).  `SPARSE` should not be used on every numerical column, and 
for most non-numerical data is not an ideal setting either. For example, in a 
large data set of first names the number of names will be small compared to the 
number of keys, and given the distribution of first names using SPARSE will 
increase the size of the index and at best have zero effect on query 
performance, but may hurt it.





 

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113816#comment-15113816
 ] 

Jack Krupansky commented on CASSANDRA-10661:


So is this stuff actually ready to release? I mean, consistent with the new 
philosophy that "trunk is always releasable"? IOW, if it does get committed, it 
will be in 3.4 no matter what? I only ask because it just seemed that there was 
stuff in flux fairly recently (a couple days ago), suggested it wasn't quite 
baked enough to be considered "releasable". 

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11060) Allow DTCS old SSTable filtering to use min timestamp instead of max

2016-01-23 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-11060:
-
Labels: dtcs  (was: )

> Allow DTCS old SSTable filtering to use min timestamp instead of max
> 
>
> Key: CASSANDRA-11060
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11060
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Sam Bisbee
>  Labels: dtcs
>
> We have observed a DTCS behavior when using TTLs where SSTables are never or 
> very rarely fully expired due to compaction, allowing expired data to be 
> "stuck" in large partially expired SSTables.
> This is because compaction filtering is performed on the max timestamp, which 
> continues to grow as SSTables are compacted together. This means they will 
> never move past max_sstable_age_days. With a sufficiently large TTL, like 30 
> days, this allows old but not expired SSTables to continue combining and 
> never become fully expired, even with a max_sstable_age_days of 1.
> As a result we have seen expired data hang around in large SSTables for over 
> six months longer than it should have. This is obviously wasteful and a disk 
> capacity issue.
> As a result we have been running an extended version of DTCS called MTCS in 
> some deployments. The only change is that it uses min timestamp instead of 
> max for compaction filtering (filterOldSSTables()). This allows SSTables to 
> move beyond max_sstable_age_days and stop compacting, which means the entire 
> SSTable can become fully expired and be dropped off disk as intended.
> You can see and test MTCS here: https://github.com/threatstack/mtcs
> I am not advocating that MTCS be its own stand alone compaction strategy. 
> However, I would like to see a configuration option for DTCS that allows you 
> to specify whether old SSTables should be filtered on min or max timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113836#comment-15113836
 ] 

Jack Krupansky commented on CASSANDRA-10661:


Is there also a way to query a SASI-indexed column by exact value? I mean, it 
seems as if by enabling prefix or contains, that it will always query by prefix 
or contains. For example, if I want to query for full first name, like where 
their full first name really is "J" and not get "John" and "James" as well, 
while at other times I am indeed looking for names starting with a prefix of 
"Jo" for "John", "Joseph", etc.

Or, can I indeed have two indexes on a single column, one a traditional exact 
match, and one a prefix match. Hmmm... in which case, which gets used if I just 
specify a column name?

CREATE INDEX first_name_full ON table 
CREATE CUSTOM INDEX first_name_prefix ...

It would be good to have an example that illustrates this. In fact, I would 
argue that first and last names are perfect examples of where you really do 
need to query on both exact match and partial match. In fact, I'm not sure I 
can think of any examples of non-tokenized text fields where you don't want to 
reserve the ability to find an exact match even if you do need partial matches 
for some queries.

Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

Are there any use cases of traditional Cassandra indexes which shouldn't almost 
automatically be converted to SPARSE. After all, the current recommended best 
practice is to avoid secondary indexes where the column cardinality is either 
very high or very low, which seems to be a match for SPARSE, although the 
precise meaning of SPARSE is still a bit fuzzy for me.

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113836#comment-15113836
 ] 

Jack Krupansky edited comment on CASSANDRA-10661 at 1/23/16 5:58 PM:
-

Is there also a way to query a SASI-indexed column by exact value? I mean, it 
seems as if by enabling prefix or contains, that it will always query by prefix 
or contains. For example, if I want to query for full first name, like where 
their full first name really is "J" and not get "John" and "James" as well, 
while at other times I am indeed looking for names starting with a prefix of 
"Jo" for "John", "Joseph", etc.

Or, can I indeed have two indexes on a single column, one a traditional exact 
match, and one a prefix match. Hmmm... in which case, which gets used if I just 
specify a column name?

CREATE INDEX first_name_full ON mytable (first_name)...
CREATE CUSTOM INDEX first_name_prefix ON mytable (first_name)...

It would be good to have an example that illustrates this. In fact, I would 
argue that first and last names are perfect examples of where you really do 
need to query on both exact match and partial match. In fact, I'm not sure I 
can think of any examples of non-tokenized text fields where you don't want to 
reserve the ability to find an exact match even if you do need partial matches 
for some queries.

Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

Are there any use cases of traditional Cassandra indexes which shouldn't almost 
automatically be converted to SPARSE. After all, the current recommended best 
practice is to avoid secondary indexes where the column cardinality is either 
very high or very low, which seems to be a match for SPARSE, although the 
precise meaning of SPARSE is still a bit fuzzy for me.

Maybe, for the first_name use case I mentioned the user would be better off 
with a first_name Materialized View using first_name in the PK instead of the 
SPARSE SASI index. In fact, by placing first_name in the partition key of the 
MV I could assure that all base table rows with the same first name would be on 
the same node.

If all of that is true, we will need to give users some decent guidance on when 
to use SPARSE SASI vs. MV (vs. classic secondary... or even DSE Search.)


was (Author: jkrupan):
Is there also a way to query a SASI-indexed column by exact value? I mean, it 
seems as if by enabling prefix or contains, that it will always query by prefix 
or contains. For example, if I want to query for full first name, like where 
their full first name really is "J" and not get "John" and "James" as well, 
while at other times I am indeed looking for names starting with a prefix of 
"Jo" for "John", "Joseph", etc.

Or, can I indeed have two indexes on a single column, one a traditional exact 
match, and one a prefix match. Hmmm... in which case, which gets used if I just 
specify a column name?

CREATE INDEX first_name_full ON mytable (first_name)...
CREATE CUSTOM INDEX first_name_prefix ON mytable (first_name)...

It would be good to have an example that illustrates this. In fact, I would 
argue that first and last names are perfect examples of where you really do 
need to query on both exact match and partial match. In fact, I'm not sure I 
can think of any examples of non-tokenized text fields where you don't want to 
reserve the ability to find an exact match even if you do need partial matches 
for some queries.

Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

Are there any use cases of traditional Cassandra indexes which shouldn't almost 
automatically be converted to SPARSE. After all, the current recommended best 
practice is to avoid secondary indexes where the column cardinality is either 
very high or very low, which seems to be a match for SPARSE, although the 
precise meaning of SPARSE is still a bit fuzzy for me.

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to 

[jira] [Updated] (CASSANDRA-11056) Use max timestamp to decide DTCS-timewindow-membership

2016-01-23 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-11056:
-
Labels: dtcs  (was: )

> Use max timestamp to decide DTCS-timewindow-membership
> --
>
> Key: CASSANDRA-11056
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11056
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Marcus Eriksson
>Assignee: Björn Hegerfors
>  Labels: dtcs
> Attachments: cassandra-2.2-CASSANDRA-11056.txt
>
>
> TWCS (CASSANDRA-9666) uses max timestamp to decide time window membership, we 
> should do the same in DTCS so that users can configure DTCS to work exactly 
> like TWCS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113842#comment-15113842
 ] 

Jon Haddad commented on CASSANDRA-10661:


If sparse means what Jack is implying, perhaps a better name for it would be 
EXACT

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113836#comment-15113836
 ] 

Jack Krupansky edited comment on CASSANDRA-10661 at 1/23/16 4:55 PM:
-

Is there also a way to query a SASI-indexed column by exact value? I mean, it 
seems as if by enabling prefix or contains, that it will always query by prefix 
or contains. For example, if I want to query for full first name, like where 
their full first name really is "J" and not get "John" and "James" as well, 
while at other times I am indeed looking for names starting with a prefix of 
"Jo" for "John", "Joseph", etc.

Or, can I indeed have two indexes on a single column, one a traditional exact 
match, and one a prefix match. Hmmm... in which case, which gets used if I just 
specify a column name?

CREATE INDEX first_name_full ON mytable (first_name)...
CREATE CUSTOM INDEX first_name_prefix ON mytable (first_name)...

It would be good to have an example that illustrates this. In fact, I would 
argue that first and last names are perfect examples of where you really do 
need to query on both exact match and partial match. In fact, I'm not sure I 
can think of any examples of non-tokenized text fields where you don't want to 
reserve the ability to find an exact match even if you do need partial matches 
for some queries.

Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

Are there any use cases of traditional Cassandra indexes which shouldn't almost 
automatically be converted to SPARSE. After all, the current recommended best 
practice is to avoid secondary indexes where the column cardinality is either 
very high or very low, which seems to be a match for SPARSE, although the 
precise meaning of SPARSE is still a bit fuzzy for me.


was (Author: jkrupan):
Is there also a way to query a SASI-indexed column by exact value? I mean, it 
seems as if by enabling prefix or contains, that it will always query by prefix 
or contains. For example, if I want to query for full first name, like where 
their full first name really is "J" and not get "John" and "James" as well, 
while at other times I am indeed looking for names starting with a prefix of 
"Jo" for "John", "Joseph", etc.

Or, can I indeed have two indexes on a single column, one a traditional exact 
match, and one a prefix match. Hmmm... in which case, which gets used if I just 
specify a column name?

CREATE INDEX first_name_full ON table 
CREATE CUSTOM INDEX first_name_prefix ...

It would be good to have an example that illustrates this. In fact, I would 
argue that first and last names are perfect examples of where you really do 
need to query on both exact match and partial match. In fact, I'm not sure I 
can think of any examples of non-tokenized text fields where you don't want to 
reserve the ability to find an exact match even if you do need partial matches 
for some queries.

Will SPARSE mode in fact give me an exact match? (Sounds like it.) In which 
case, would I be better off with a SPARSE index for first_name_full, or would a 
traditional Cassandra non-custom index work fine (or even better.)

Are there any use cases of traditional Cassandra indexes which shouldn't almost 
automatically be converted to SPARSE. After all, the current recommended best 
practice is to avoid secondary indexes where the column cardinality is either 
very high or very low, which seems to be a match for SPARSE, although the 
precise meaning of SPARSE is still a bit fuzzy for me.

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113842#comment-15113842
 ] 

Jon Haddad edited comment on CASSANDRA-10661 at 1/23/16 5:02 PM:
-

If sparse means what Jack is implying, perhaps a better name for it would be 
EXACT.

Using SPARSE will usually result in people asking "what does that mean", and 
the answer will be "exact match" so I propose we just use that as it'll cut 
down on the number of questions people have.


was (Author: rustyrazorblade):
If sparse means what Jack is implying, perhaps a better name for it would be 
EXACT

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113967#comment-15113967
 ] 

Pavel Yaskevich commented on CASSANDRA-10661:
-

bq. So is this stuff actually ready to release? I mean, consistent with the new 
philosophy that "trunk is always releasable"? IOW, if it does get committed, it 
will be in 3.4 no matter what? I only ask because it just seemed that there was 
stuff in flux fairly recently (a couple days ago), suggested it wasn't quite 
baked enough to be considered "releasable".

Yes, the stuff is ready to release since fairly recently added changes are 
ported from 2.0 and clustering support is just couple of lines of additional 
filtering added, no internal data structure changes, this is also opt-in 
feature which is irrelevant for core functionality until enabled. This is also 
the reason why we don't want do any of the CQL front-end related changes right 
away but rather more gradual migration.

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.x
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10937) OOM on multiple nodes on write load (v. 3.0.0), problem also present on DSE-4.8.3, but there it survives more time

2016-01-23 Thread Peter Kovgan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114212#comment-15114212
 ] 

Peter Kovgan commented on CASSANDRA-10937:
--

Thank you, Jack.

I answer inline:

I still don't see any reason to believe that there is a bug here and that the 
primary issue is that you are overloading the cluster.

Peter: Agree and hope this is the reason

Sure, Cassandra should do a better job of shedding/failing excessive incoming 
requests, and there is an open Jira ticket to add just such a feature, but even 
with that new feature, the net effect will be the same - it will still be up to 
the application and operations to properly size the cluster and throttle 
application load before it gets to Cassandra.

Peter: No problem, I understand the driving force for that, I only claim that 
friendly warning would be appropriate in case of estimated danger of 
approaching OOM.It is hard to do that, I understand. Some situations are not 
easy to analyze and make conclusions. But see below…

OOM is not typically an indication of a software bug. Sure, sometimes code has 
memory leaks, but with a highly dynamic system such as Cassandra, it typically 
means either a misconfigured JVM or just very heavy load. Sometimes OOM simply 
means that there is a lot of background processing going on (like compactions 
or hinted handoff) that is having trouble keeping up with incoming requests. 
Sometimes OOM occurs because you have too large a heap which defers GC but then 
GC takes too long and further incoming requests simply generate more pressure 
on the heap faster than that massive GC can deal with it.

Peter: Regarding compactions.. I could imagine that. We notice progressive 
growth in IO demand.
So, I would take IO wait progressive growth as a warning trigger for possible 
approaching OOM.E.g. if normal IO wait configured as 0.3%, and system 
progressively goes through some configured thresholds of  0.7, 1.0, 1.5 % , I 
would like to notice that in some warning log.This way, I can judge earlier, 
that I need increase the ring or wait an OOM.
Now, in latest test, I see pending comactions gradually increases. Very slowely.
Two days ago it was 40, now 135, I wonder, is it a sign of a pending problem?


It is indeed tricky to make sure the JVM has enough heap but not too much. 

Peter: Aware of that. I deal with GC issues in general more frequently than 
others in my company. Previous DSE tests done with G1, providing multiple of 
2048Mb (G1 recommendation), concretely I gave it 73728M Here I assume effective 
GC with G1 is more a function of available CPU, because there are a lot of 
“young” and “old” spaces and things are more complicated than in Concurrent 
collector. CPU was fine when OOM happened, a lot of idle, another sign that IO 
is a bottleneck.
We now test 2 single node installations, one with 36G heap and one with 73gb. I 
want see which one is doing better. We also reduced load to 5 Mb/sec, instead 
of 25-30. 

DSE typically runs with a larger heap by default. You can try increasing your 
heap to 10 or 12G. But if you make the heap too big, the big GC can bite you as 
described above. In that case, the heap needs to be reduced. Typically you 
don't need a heap smaller than 8 GB. If OOM occurs with a 8 GB heap it 
typically means the load on that node is simply too heavy.
Be sure to review the recommendations in this blog post on reasonable 
recommendations:
http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra

Peter: Done.All is by the book, except:
We use custom producer and custom data model.
We change data model, trying make it more effective, last change was adding day 
to partition, we want avoid too wide rows. Our producer is multi-threaded and 
configurable. 

A few questions that will help us better understand what you are really trying 
to do:
1. How much reading are you doing and when relative to writes?
Peter: In OOM-ended tests(In all tests before) we did only writes. Just 
recently, with lower load I started did reads.
Meanwhile it is OK. (4 days passed)

2. Are you doing any updates or deletes? (Cause compaction, which can fall 
behind your write/update load.)
Peter: No, no updates, and will not do. Our TTL will be set for 4 weeks in 
production. Now I do no TTL to test reads on greater data storage.

3. How much data is on the cluster (rows)?

Peter:
This info is currently unavailable (for OOM-ended tests and previous particular 
data model). I cannot check, because Cassandra fails on OOM during restart and 
I have no different environment to see.
But for today’s test (we added a day to partition, other parameters are the 
same ) estimated numbers from nodetool sfstats are:
Number of keys (estimate): 2000
Number of keys (estimate): 10142095
Number of keys (estimate): 350
Number of keys (estimate): 2000
Number of keys (estimate): 350
Number of keys (estimate): 12491
I assume now 

[04/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/resources/tokenization/apache_license_header.txt
--
diff --git a/test/resources/tokenization/apache_license_header.txt 
b/test/resources/tokenization/apache_license_header.txt
new file mode 100644
index 000..d973dce
--- /dev/null
+++ b/test/resources/tokenization/apache_license_header.txt
@@ -0,0 +1,16 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/resources/tokenization/ja_jp_1.txt
--
diff --git a/test/resources/tokenization/ja_jp_1.txt 
b/test/resources/tokenization/ja_jp_1.txt
new file mode 100644
index 000..1a0a198
--- /dev/null
+++ b/test/resources/tokenization/ja_jp_1.txt
@@ -0,0 +1 @@
+古写本は題名の記されていないものも多く、記されているå
 ´åˆã§ã‚っても内容はさまざまである。『源氏物語』のå 
´åˆã¯å†Šå­ã®æ¨™é¡Œã¨ã—て「源氏物語」ないしそれに相当する物語å
…¨ä½“の標題が記されているå 
´åˆã‚ˆã‚Šã‚‚、それぞれの帖名が記されていることが少なくない。こうした経緯から、現在において一般に『源氏物語』と呼ばれているこの物語が書かれた当時の題名が何であったのかは明らかではない。古い時代の写本や注釈書などの文献に記されている名称は大きく以下の系統に分かれる。
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/resources/tokenization/ja_jp_2.txt
--
diff --git a/test/resources/tokenization/ja_jp_2.txt 
b/test/resources/tokenization/ja_jp_2.txt
new file mode 100644
index 000..278b4fd
--- /dev/null
+++ b/test/resources/tokenization/ja_jp_2.txt
@@ -0,0 +1,2 @@
+中野幸一編『常用 
源氏物語要覧』武蔵野書院、1997年(平成9年)。 ISBN 
4-8386-0383-5
+その他にCD-ROM化された本文検索システム
として次のようなものがある。
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/resources/tokenization/lorem_ipsum.txt
--
diff --git a/test/resources/tokenization/lorem_ipsum.txt 
b/test/resources/tokenization/lorem_ipsum.txt
new file mode 100644
index 000..14a4477
--- /dev/null
+++ b/test/resources/tokenization/lorem_ipsum.txt
@@ -0,0 +1 @@
+"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod 
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo 
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse 
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non 
proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/resources/tokenization/ru_ru_1.txt
--
diff --git a/test/resources/tokenization/ru_ru_1.txt 
b/test/resources/tokenization/ru_ru_1.txt
new file mode 100644
index 000..c19a9be
--- /dev/null
+++ b/test/resources/tokenization/ru_ru_1.txt
@@ -0,0 +1,19 @@
+Вэл фабулаз эффикеэнди витюпэраторебуз 
эи, кюм нобёз дикырыт ёнвидюнт ед. Ючю золэт 
ийжквюы эа, нык но элитр волуптюа 
пэркёпитюр. Ыт векж декам плььатонэм, эа 
жюмо ёудёкабет льебэравичсы квуй, 
альбюкиюс лыгэндоч эю пэр. Еюж ед аутым 
нюмквуам тебиквюэ, эи амэт дэбыт нюлльам 
квюо. Ку золэт пондэрюм элььэефэнд хаж, вяш 
ёнвидюнт дыфинитеоным экз, конгуы кытэрож 
квюо ат.
+
+Ад фиэрэнт 

[14/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
Integrate SASI index into Cassandra

patch by xedin; reviewed by beobal for CASSANDRA-10661


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/72790dc8
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/72790dc8
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/72790dc8

Branch: refs/heads/trunk
Commit: 72790dc8e34826b39ac696b03025ae6b7b6beb2b
Parents: 11c8ca6
Author: Pavel Yaskevich 
Authored: Wed Dec 2 19:23:54 2015 -0800
Committer: Pavel Yaskevich 
Committed: Sat Jan 23 19:35:29 2016 -0800

--
 CHANGES.txt | 1 +
 build.xml   |22 +-
 doc/SASI.md |   768 +
 lib/concurrent-trees-2.4.0.jar  |   Bin 0 -> 118696 bytes
 lib/hppc-0.5.4.jar  |   Bin 0 -> 1305173 bytes
 lib/jflex-1.6.0.jar |   Bin 0 -> 1048690 bytes
 lib/licenses/concurrent-trees-2.4.0.txt |   201 +
 lib/licenses/hppc-0.5.4.txt |   202 +
 lib/licenses/jflex-1.6.0.txt|   201 +
 lib/licenses/primitive-1.0.txt  |   201 +
 lib/licenses/snowball-stemmer-1.3.0.581.1.txt   |   201 +
 lib/primitive-1.0.jar   |   Bin 0 -> 52589 bytes
 lib/snowball-stemmer-1.3.0.581.1.jar|   Bin 0 -> 93019 bytes
 .../cassandra/config/DatabaseDescriptor.java| 7 +-
 .../org/apache/cassandra/db/ColumnIndex.java| 6 +-
 .../apache/cassandra/db/filter/RowFilter.java   |15 +-
 .../cassandra/index/SecondaryIndexManager.java  |11 +
 .../apache/cassandra/index/sasi/SASIIndex.java  |   288 +
 .../cassandra/index/sasi/SASIIndexBuilder.java  |   128 +
 .../cassandra/index/sasi/SSTableIndex.java  |   187 +
 .../org/apache/cassandra/index/sasi/Term.java   |65 +
 .../cassandra/index/sasi/TermIterator.java  |   208 +
 .../index/sasi/analyzer/AbstractAnalyzer.java   |51 +
 .../index/sasi/analyzer/NoOpAnalyzer.java   |54 +
 .../sasi/analyzer/NonTokenizingAnalyzer.java|   126 +
 .../sasi/analyzer/NonTokenizingOptions.java |   147 +
 .../sasi/analyzer/SUPPLEMENTARY.jflex-macro |   143 +
 .../index/sasi/analyzer/StandardAnalyzer.java   |   194 +
 .../sasi/analyzer/StandardTokenizerImpl.jflex   |   220 +
 .../analyzer/StandardTokenizerInterface.java|65 +
 .../sasi/analyzer/StandardTokenizerOptions.java |   272 +
 .../analyzer/filter/BasicResultFilters.java |76 +
 .../analyzer/filter/FilterPipelineBuilder.java  |51 +
 .../analyzer/filter/FilterPipelineExecutor.java |53 +
 .../analyzer/filter/FilterPipelineTask.java |52 +
 .../sasi/analyzer/filter/StemmerFactory.java|   101 +
 .../sasi/analyzer/filter/StemmingFilters.java   |46 +
 .../sasi/analyzer/filter/StopWordFactory.java   |   100 +
 .../sasi/analyzer/filter/StopWordFilters.java   |42 +
 .../cassandra/index/sasi/conf/ColumnIndex.java  |   193 +
 .../cassandra/index/sasi/conf/DataTracker.java  |   162 +
 .../cassandra/index/sasi/conf/IndexMode.java|   169 +
 .../index/sasi/conf/view/PrefixTermTree.java|   194 +
 .../index/sasi/conf/view/RangeTermTree.java |77 +
 .../index/sasi/conf/view/TermTree.java  |58 +
 .../cassandra/index/sasi/conf/view/View.java|   104 +
 .../cassandra/index/sasi/disk/Descriptor.java   |51 +
 .../cassandra/index/sasi/disk/OnDiskBlock.java  |   142 +
 .../cassandra/index/sasi/disk/OnDiskIndex.java  |   773 ++
 .../index/sasi/disk/OnDiskIndexBuilder.java |   627 +
 .../index/sasi/disk/PerSSTableIndexWriter.java  |   361 +
 .../apache/cassandra/index/sasi/disk/Token.java |42 +
 .../cassandra/index/sasi/disk/TokenTree.java|   519 +
 .../index/sasi/disk/TokenTreeBuilder.java   |   839 ++
 .../exceptions/TimeQuotaExceededException.java  |21 +
 .../index/sasi/memory/IndexMemtable.java|71 +
 .../index/sasi/memory/KeyRangeIterator.java |   118 +
 .../cassandra/index/sasi/memory/MemIndex.java   |51 +
 .../index/sasi/memory/SkipListMemIndex.java |97 +
 .../index/sasi/memory/TrieMemIndex.java |   254 +
 .../cassandra/index/sasi/plan/Expression.java   |   340 +
 .../cassandra/index/sasi/plan/Operation.java|   477 +
 .../index/sasi/plan/QueryController.java|   261 +
 .../cassandra/index/sasi/plan/QueryPlan.java|   170 +
 .../cassandra/index/sasi/sa/ByteTerm.java   |51 +
 .../cassandra/index/sasi/sa/CharTerm.java   |54 +
 .../cassandra/index/sasi/sa/IntegralSA.java |84 +
 .../org/apache/cassandra/index/sasi/sa/SA.java  |58 +
 .../cassandra/index/sasi/sa/SuffixSA.java   |   143 +
 .../apache/cassandra/index/sasi/sa/Term.java|58 +
 .../cassandra/index/sasi/sa/TermIterator.java   |31 +
 

[13/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/lib/licenses/jflex-1.6.0.txt
--
diff --git a/lib/licenses/jflex-1.6.0.txt b/lib/licenses/jflex-1.6.0.txt
new file mode 100644
index 000..50086f8
--- /dev/null
+++ b/lib/licenses/jflex-1.6.0.txt
@@ -0,0 +1,201 @@
+ Apache License
+   Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+  "License" shall mean the terms and conditions for use, reproduction,
+  and distribution as defined by Sections 1 through 9 of this document.
+
+  "Licensor" shall mean the copyright owner or entity authorized by
+  the copyright owner that is granting the License.
+
+  "Legal Entity" shall mean the union of the acting entity and all
+  other entities that control, are controlled by, or are under common
+  control with that entity. For the purposes of this definition,
+  "control" means (i) the power, direct or indirect, to cause the
+  direction or management of such entity, whether by contract or
+  otherwise, or (ii) ownership of fifty percent (50%) or more of the
+  outstanding shares, or (iii) beneficial ownership of such entity.
+
+  "You" (or "Your") shall mean an individual or Legal Entity
+  exercising permissions granted by this License.
+
+  "Source" form shall mean the preferred form for making modifications,
+  including but not limited to software source code, documentation
+  source, and configuration files.
+
+  "Object" form shall mean any form resulting from mechanical
+  transformation or translation of a Source form, including but
+  not limited to compiled object code, generated documentation,
+  and conversions to other media types.
+
+  "Work" shall mean the work of authorship, whether in Source or
+  Object form, made available under the License, as indicated by a
+  copyright notice that is included in or attached to the work
+  (an example is provided in the Appendix below).
+
+  "Derivative Works" shall mean any work, whether in Source or Object
+  form, that is based on (or derived from) the Work and for which the
+  editorial revisions, annotations, elaborations, or other modifications
+  represent, as a whole, an original work of authorship. For the purposes
+  of this License, Derivative Works shall not include works that remain
+  separable from, or merely link (or bind by name) to the interfaces of,
+  the Work and Derivative Works thereof.
+
+  "Contribution" shall mean any work of authorship, including
+  the original version of the Work and any modifications or additions
+  to that Work or Derivative Works thereof, that is intentionally
+  submitted to Licensor for inclusion in the Work by the copyright owner
+  or by an individual or Legal Entity authorized to submit on behalf of
+  the copyright owner. For the purposes of this definition, "submitted"
+  means any form of electronic, verbal, or written communication sent
+  to the Licensor or its representatives, including but not limited to
+  communication on electronic mailing lists, source code control systems,
+  and issue tracking systems that are managed by, or on behalf of, the
+  Licensor for the purpose of discussing and improving the Work, but
+  excluding communication that is conspicuously marked or otherwise
+  designated in writing by the copyright owner as "Not a Contribution."
+
+  "Contributor" shall mean Licensor and any individual or Legal Entity
+  on behalf of whom a Contribution has been received by Licensor and
+  subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+  this License, each Contributor hereby grants to You a perpetual,
+  worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+  copyright license to reproduce, prepare Derivative Works of,
+  publicly display, publicly perform, sublicense, and distribute the
+  Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+  this License, each Contributor hereby grants to You a perpetual,
+  worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+  (except as stated in this section) patent license to make, have made,
+  use, offer to sell, sell, import, and otherwise transfer the Work,
+  where such license applies only to those patent claims licensable
+  by such Contributor that are necessarily infringed by their
+  Contribution(s) alone or by combination of their Contribution(s)
+  with the Work to which such Contribution(s) was submitted. If You
+  

[05/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/resources/tokenization/adventures_of_huckleberry_finn_mark_twain.txt
--
diff --git 
a/test/resources/tokenization/adventures_of_huckleberry_finn_mark_twain.txt 
b/test/resources/tokenization/adventures_of_huckleberry_finn_mark_twain.txt
new file mode 100644
index 000..27cadc3
--- /dev/null
+++ b/test/resources/tokenization/adventures_of_huckleberry_finn_mark_twain.txt
@@ -0,0 +1,12361 @@
+
+
+The Project Gutenberg EBook of Adventures of Huckleberry Finn, Complete
+by Mark Twain (Samuel Clemens)
+
+This eBook is for the use of anyone anywhere at no cost and with almost
+no restrictions whatsoever. You may copy it, give it away or re-use
+it under the terms of the Project Gutenberg License included with this
+eBook or online at www.gutenberg.net
+
+Title: Adventures of Huckleberry Finn, Complete
+
+Author: Mark Twain (Samuel Clemens)
+
+Release Date: August 20, 2006 [EBook #76]
+
+Last Updated: October 20, 2012]
+
+Language: English
+
+
+*** START OF THIS PROJECT GUTENBERG EBOOK HUCKLEBERRY FINN ***
+
+Produced by David Widger
+
+
+
+
+
+ADVENTURES
+
+OF
+
+HUCKLEBERRY FINN
+
+(Tom Sawyer's Comrade)
+
+By Mark Twain
+
+Complete
+
+
+
+
+CONTENTS.
+
+CHAPTER I. Civilizing Huck.—Miss Watson.—Tom Sawyer Waits.
+
+CHAPTER II. The Boys Escape Jim.—Torn Sawyer's Gang.—Deep-laid Plans.
+
+CHAPTER III. A Good Going-over.—Grace Triumphant.—"One of Tom Sawyers's
+Lies".
+
+CHAPTER IV. Huck and the Judge.—Superstition.
+
+CHAPTER V. Huck's Father.—The Fond Parent.—Reform.
+
+CHAPTER VI. He Went for Judge Thatcher.—Huck Decided to Leave.—Political
+Economy.—Thrashing Around.
+
+CHAPTER VII. Laying for Him.—Locked in the Cabin.—Sinking the
+Body.—Resting.
+
+CHAPTER VIII. Sleeping in the Woods.—Raising the Dead.—Exploring the
+Island.—Finding Jim.—Jim's Escape.—Signs.—Balum.
+
+CHAPTER IX. The Cave.—The Floating House.
+
+CHAPTER X. The Find.—Old Hank Bunker.—In Disguise.
+
+CHAPTER XI. Huck and the Woman.—The Search.—Prevarication.—Going to
+Goshen.
+
+CHAPTER XII. Slow Navigation.—Borrowing Things.—Boarding the Wreck.—The
+Plotters.—Hunting for the Boat.
+
+CHAPTER XIII. Escaping from the Wreck.—The Watchman.—Sinking.
+
+CHAPTER XIV. A General Good Time.—The Harem.—French.
+
+CHAPTER XV. Huck Loses the Raft.—In the Fog.—Huck Finds the Raft.—Trash.
+
+CHAPTER XVI. Expectation.—A White Lie.—Floating Currency.—Running by
+Cairo.—Swimming Ashore.
+
+CHAPTER XVII. An Evening Call.—The Farm in Arkansaw.—Interior
+Decorations.—Stephen Dowling Bots.—Poetical Effusions.
+
+CHAPTER XVIII. Col. Grangerford.—Aristocracy.—Feuds.—The
+Testament.—Recovering the Raft.—The Wood—pile.—Pork and Cabbage.
+
+CHAPTER XIX. Tying Up Day—times.—An Astronomical Theory.—Running a
+Temperance Revival.—The Duke of Bridgewater.—The Troubles of Royalty.
+
+CHAPTER XX. Huck Explains.—Laying Out a Campaign.—Working the
+Camp—meeting.—A Pirate at the Camp—meeting.—The Duke as a Printer.
+
+CHAPTER XXI. Sword Exercise.—Hamlet's Soliloquy.—They Loafed Around
+Town.—A Lazy Town.—Old Boggs.—Dead.
+
+CHAPTER XXII. Sherburn.—Attending the Circus.—Intoxication in the
+Ring.—The Thrilling Tragedy.
+
+CHAPTER XXIII. Sold.—Royal Comparisons.—Jim Gets Home-sick.
+
+CHAPTER XXIV. Jim in Royal Robes.—They Take a Passenger.—Getting
+Information.—Family Grief.
+
+CHAPTER XXV. Is It Them?—Singing the "Doxologer."—Awful Square—Funeral
+Orgies.—A Bad Investment .
+
+CHAPTER XXVI. A Pious King.—The King's Clergy.—She Asked His
+Pardon.—Hiding in the Room.—Huck Takes the Money.
+
+CHAPTER XXVII. The Funeral.—Satisfying Curiosity.—Suspicious of
+Huck,—Quick Sales and Small.
+
+CHAPTER XXVIII. The Trip to England.—"The Brute!"—Mary Jane Decides to
+Leave.—Huck Parting with Mary Jane.—Mumps.—The Opposition Line.
+
+CHAPTER XXIX. Contested Relationship.—The King Explains the Loss.—A
+Question of Handwriting.—Digging up the Corpse.—Huck Escapes.
+
+CHAPTER XXX. The King Went for Him.—A Royal Row.—Powerful Mellow.
+
+CHAPTER XXXI. Ominous Plans.—News from Jim.—Old Recollections.—A Sheep
+Story.—Valuable Information.
+
+CHAPTER XXXII. Still and Sunday—like.—Mistaken Identity.—Up a Stump.—In
+a Dilemma.
+
+CHAPTER XXXIII. A Nigger Stealer.—Southern Hospitality.—A Pretty Long
+Blessing.—Tar and Feathers.
+
+CHAPTER XXXIV. The Hut by the Ash Hopper.—Outrageous.—Climbing the
+Lightning Rod.—Troubled with Witches.
+
+CHAPTER XXXV. Escaping Properly.—Dark Schemes.—Discrimination in
+Stealing.—A Deep Hole.
+
+CHAPTER XXXVI. The Lightning Rod.—His Level Best.—A Bequest to
+Posterity.—A High Figure.
+
+CHAPTER XXXVII. The Last Shirt.—Mooning Around.—Sailing Orders.—The
+Witch Pie.
+
+CHAPTER XXXVIII. The Coat of Arms.—A Skilled Superintendent.—Unpleasant
+Glory.—A Tearful Subject.
+
+CHAPTER XXXIX. Rats.—Lively 

[11/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/src/java/org/apache/cassandra/index/sasi/conf/view/PrefixTermTree.java
--
diff --git 
a/src/java/org/apache/cassandra/index/sasi/conf/view/PrefixTermTree.java 
b/src/java/org/apache/cassandra/index/sasi/conf/view/PrefixTermTree.java
new file mode 100644
index 000..72b6daf
--- /dev/null
+++ b/src/java/org/apache/cassandra/index/sasi/conf/view/PrefixTermTree.java
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi.conf.view;
+
+import java.nio.ByteBuffer;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.cassandra.index.sasi.SSTableIndex;
+import org.apache.cassandra.index.sasi.disk.OnDiskIndexBuilder;
+import org.apache.cassandra.index.sasi.plan.Expression;
+import org.apache.cassandra.index.sasi.utils.trie.KeyAnalyzer;
+import org.apache.cassandra.index.sasi.utils.trie.PatriciaTrie;
+import org.apache.cassandra.index.sasi.utils.trie.Trie;
+import org.apache.cassandra.db.marshal.AbstractType;
+import org.apache.cassandra.utils.Interval;
+import org.apache.cassandra.utils.IntervalTree;
+
+import com.google.common.collect.Sets;
+
+/**
+ * This class is an extension over RangeTermTree for string terms,
+ * it is required because interval tree can't handle matching if search is on 
the
+ * prefix of min/max of the range, so for ascii/utf8 fields we build an 
additional
+ * prefix trie (including both min/max terms of the index) and do union of the 
results
+ * of the prefix tree search and results from the interval tree lookup.
+ */
+public class PrefixTermTree extends RangeTermTree
+{
+private final OnDiskIndexBuilder.Mode mode;
+private final Trie trie;
+
+public PrefixTermTree(ByteBuffer min, ByteBuffer max,
+  Trie trie,
+  IntervalTree> ranges,
+  OnDiskIndexBuilder.Mode mode)
+{
+super(min, max, ranges);
+
+this.mode = mode;
+this.trie = trie;
+}
+
+public Set search(Expression e)
+{
+Map indexes = (e == null || e.lower == 
null || mode == OnDiskIndexBuilder.Mode.CONTAINS)
+? trie : 
trie.prefixMap(e.lower.value);
+
+Set view = new HashSet<>(indexes.size());
+indexes.values().forEach(view::addAll);
+
+return Sets.union(view, super.search(e));
+}
+
+public static class Builder extends RangeTermTree.Builder
+{
+private final PatriciaTrie trie;
+
+protected Builder(OnDiskIndexBuilder.Mode mode, final AbstractType 
comparator)
+{
+super(mode, comparator);
+trie = new PatriciaTrie<>(new ByteBufferKeyAnalyzer(comparator));
+}
+
+public void addIndex(SSTableIndex index)
+{
+super.addIndex(index);
+addTerm(index.minTerm(), index);
+addTerm(index.maxTerm(), index);
+}
+
+public TermTree build()
+{
+return new PrefixTermTree(min, max, trie, 
IntervalTree.build(intervals), mode);
+}
+
+private void addTerm(ByteBuffer term, SSTableIndex index)
+{
+Set indexes = trie.get(term);
+if (indexes == null)
+trie.put(term, (indexes = new HashSet<>()));
+
+indexes.add(index);
+}
+}
+
+private static class ByteBufferKeyAnalyzer implements 
KeyAnalyzer
+{
+private final AbstractType comparator;
+
+public ByteBufferKeyAnalyzer(AbstractType comparator)
+{
+this.comparator = comparator;
+}
+
+/**
+ * A bit mask where the first bit is 1 and the others are zero
+ */
+private static final int MSB = 1 << Byte.SIZE-1;
+
+public int compare(ByteBuffer a, ByteBuffer b)
+{
+return comparator.compare(a, b);
+}
+
+public int lengthInBits(ByteBuffer o)

[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-01-23 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114168#comment-15114168
 ] 

Pavel Yaskevich commented on CASSANDRA-10661:
-

Pushed as squashed commit 
[72790dc|https://github.com/apache/cassandra/commit/72790dc8e34826b39ac696b03025ae6b7b6beb2b].
 I'm going to resolve this issue and promote CASSANDRA-10765 from sub-task.

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.4
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[06/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/src/resources/org/apache/cassandra/index/sasi/analyzer/filter/fi_ST.txt
--
diff --git 
a/src/resources/org/apache/cassandra/index/sasi/analyzer/filter/fi_ST.txt 
b/src/resources/org/apache/cassandra/index/sasi/analyzer/filter/fi_ST.txt
new file mode 100644
index 000..3c8bfd5
--- /dev/null
+++ b/src/resources/org/apache/cassandra/index/sasi/analyzer/filter/fi_ST.txt
@@ -0,0 +1,748 @@
+# Stop Words List from http://members.unine.ch/jacques.savoy/clef/index.html
+aiemmin
+aika
+aikaa
+aikaan
+aikaisemmin
+aikaisin
+aikajen
+aikana
+aikoina
+aikoo
+aikovat
+aina
+ainakaan
+ainakin
+ainoa
+ainoat
+aiomme
+aion
+aiotte
+aist
+aivan
+ajan
+�l�
+alas
+alemmas
+�lk��n
+alkuisin
+alkuun
+alla
+alle
+aloitamme
+aloitan
+aloitat
+aloitatte
+aloitattivat
+aloitettava
+aloitettevaksi
+aloitettu
+aloitimme
+aloitin
+aloitit
+aloititte
+aloittaa
+aloittamatta
+aloitti
+aloittivat
+alta
+aluksi
+alussa
+alusta
+annettavaksi
+annetteva
+annettu
+antaa
+antamatta
+antoi
+aoua
+apu
+asia
+asiaa
+asian
+asiasta
+asiat
+asioiden
+asioihin
+asioita
+asti
+avuksi
+avulla
+avun
+avutta
+edell�
+edelle
+edelleen
+edelt�
+edemm�s
+edes
+edess�
+edest�
+ehk�
+ei
+eik�
+eilen
+eiv�t
+eli
+ellei
+elleiv�t
+ellemme
+ellen
+ellet
+ellette
+emme
+en
+en��
+enemm�n
+eniten
+ennen
+ensi
+ensimm�inen
+ensimm�iseksi
+ensimm�isen
+ensimm�isen�
+ensimm�iset
+ensimm�isi�
+ensimm�isiksi
+ensimm�isin�
+ensimm�ist�
+ensin
+entinen
+entisen
+entisi�
+entist�
+entisten
+er��t
+er�iden
+er�s
+eri
+eritt�in
+erityisesti
+esi
+esiin
+esill�
+esimerkiksi
+et
+eteen
+etenkin
+ett�
+ette
+ettei
+halua
+haluaa
+haluamatta
+haluamme
+haluan
+haluat
+haluatte
+haluavat
+halunnut
+halusi
+halusimme
+halusin
+halusit
+halusitte
+halusivat
+halutessa
+haluton
+h�n
+h�neen
+h�nell�
+h�nelle
+h�nelt�
+h�nen
+h�ness�
+h�nest�
+h�net
+he
+hei
+heid�n
+heihin
+heille
+heilt�
+heiss�
+heist�
+heit�
+helposti
+heti
+hetkell�
+hieman
+huolimatta
+huomenna
+hyv�
+hyv��
+hyv�t
+hyvi�
+hyvien
+hyviin
+hyviksi
+hyville
+hyvilt�
+hyvin
+hyvin�
+hyviss�
+hyvist�
+ihan
+ilman
+ilmeisesti
+itse
+itse��n
+itsens�
+ja
+j��
+j�lkeen
+j�lleen
+jo
+johon
+joiden
+joihin
+joiksi
+joilla
+joille
+joilta
+joissa
+joista
+joita
+joka
+jokainen
+jokin
+joko
+joku
+jolla
+jolle
+jolloin
+jolta
+jompikumpi
+jonka
+jonkin
+jonne
+joo
+jopa
+jos
+joskus
+jossa
+josta
+jota
+jotain
+joten
+jotenkin
+jotenkuten
+jotka
+jotta
+jouduimme
+jouduin
+jouduit
+jouduitte
+joudumme
+joudun
+joudutte
+joukkoon
+joukossa
+joukosta
+joutua
+joutui
+joutuivat
+joutumaan
+joutuu
+joutuvat
+juuri
+kahdeksan
+kahdeksannen
+kahdella
+kahdelle
+kahdelta
+kahden
+kahdessa
+kahdesta
+kahta
+kahteen
+kai
+kaiken
+kaikille
+kaikilta
+kaikkea
+kaikki
+kaikkia
+kaikkiaan
+kaikkialla
+kaikkialle
+kaikkialta
+kaikkien
+kaikkin
+kaksi
+kannalta
+kannattaa
+kanssa
+kanssaan
+kanssamme
+kanssani
+kanssanne
+kanssasi
+kauan
+kauemmas
+kautta
+kehen
+keiden
+keihin
+keiksi
+keill�
+keille
+keilt�
+kein�
+keiss�
+keist�
+keit�
+keitt�
+keitten
+keneen
+keneksi
+kenell�
+kenelle
+kenelt�
+kenen
+kenen�
+keness�
+kenest�
+kenet
+kenett�
+kenness�st�
+kerran
+kerta
+kertaa
+kesken
+keskim��rin
+ket�
+ketk�
+kiitos
+kohti
+koko
+kokonaan
+kolmas
+kolme
+kolmen
+kolmesti
+koska
+koskaan
+kovin
+kuin
+kuinka
+kuitenkaan
+kuitenkin
+kuka
+kukaan
+kukin
+kumpainen
+kumpainenkaan
+kumpi
+kumpikaan
+kumpikin
+kun
+kuten
+kuuden
+kuusi
+kuutta
+kyll�
+kymmenen
+kyse
+l�hekk�in
+l�hell�
+l�helle
+l�helt�
+l�hemm�s
+l�hes
+l�hinn�
+l�htien
+l�pi
+liian
+liki
+lis��
+lis�ksi
+luo
+mahdollisimman
+mahdollista
+me
+meid�n
+meill�
+meille
+melkein
+melko
+menee
+meneet
+menemme
+menen
+menet
+menette
+menev�t
+meni
+menimme
+menin
+menit
+meniv�t
+menness�
+mennyt
+menossa
+mihin
+mik�
+mik��n
+mik�li
+mikin
+miksi
+milloin
+min�
+minne
+minun
+minut
+miss�
+mist�
+mit�
+mit��n
+miten
+moi
+molemmat
+mones
+monesti
+monet
+moni
+moniaalla
+moniaalle
+moniaalta
+monta
+muassa
+muiden
+muita
+muka
+mukaan
+mukaansa
+mukana
+mutta
+muu
+muualla
+muualle
+muualta
+muuanne
+muulloin
+muun
+muut
+muuta
+muutama
+muutaman
+muuten
+my�hemmin
+my�s
+my�sk��n
+my�skin
+my�t�
+n�iden
+n�in
+n�iss�
+n�iss�hin
+n�iss�lle
+n�iss�lt�
+n�iss�st�
+n�it�
+n�m�
+ne
+nelj�
+nelj��
+nelj�n
+niiden
+niin
+niist�
+niit�
+noin
+nopeammin
+nopeasti
+nopeiten
+nro
+nuo
+nyt
+ohi
+oikein
+ole
+olemme
+olen
+olet
+olette
+oleva
+olevan
+olevat
+oli
+olimme
+olin
+olisi
+olisimme
+olisin
+olisit
+olisitte
+olisivat
+olit
+olitte
+olivat
+olla
+olleet
+olli
+ollut
+oma
+omaa
+omaan
+omaksi
+omalle
+omalta
+oman
+omassa
+omat

[12/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/src/java/org/apache/cassandra/index/sasi/analyzer/NonTokenizingOptions.java
--
diff --git 
a/src/java/org/apache/cassandra/index/sasi/analyzer/NonTokenizingOptions.java 
b/src/java/org/apache/cassandra/index/sasi/analyzer/NonTokenizingOptions.java
new file mode 100644
index 000..303087b
--- /dev/null
+++ 
b/src/java/org/apache/cassandra/index/sasi/analyzer/NonTokenizingOptions.java
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi.analyzer;
+
+import java.util.Map;
+
+public class NonTokenizingOptions
+{
+public static final String NORMALIZE_LOWERCASE = "normalize_lowercase";
+public static final String NORMALIZE_UPPERCASE = "normalize_uppercase";
+public static final String CASE_SENSITIVE = "case_sensitive";
+
+private boolean caseSensitive;
+private boolean upperCaseOutput;
+private boolean lowerCaseOutput;
+
+public boolean isCaseSensitive()
+{
+return caseSensitive;
+}
+
+public void setCaseSensitive(boolean caseSensitive)
+{
+this.caseSensitive = caseSensitive;
+}
+
+public boolean shouldUpperCaseOutput()
+{
+return upperCaseOutput;
+}
+
+public void setUpperCaseOutput(boolean upperCaseOutput)
+{
+this.upperCaseOutput = upperCaseOutput;
+}
+
+public boolean shouldLowerCaseOutput()
+{
+return lowerCaseOutput;
+}
+
+public void setLowerCaseOutput(boolean lowerCaseOutput)
+{
+this.lowerCaseOutput = lowerCaseOutput;
+}
+
+public static class OptionsBuilder
+{
+private boolean caseSensitive = true;
+private boolean upperCaseOutput = false;
+private boolean lowerCaseOutput = false;
+
+public OptionsBuilder()
+{
+}
+
+public OptionsBuilder caseSensitive(boolean caseSensitive)
+{
+this.caseSensitive = caseSensitive;
+return this;
+}
+
+public OptionsBuilder upperCaseOutput(boolean upperCaseOutput)
+{
+this.upperCaseOutput = upperCaseOutput;
+return this;
+}
+
+public OptionsBuilder lowerCaseOutput(boolean lowerCaseOutput)
+{
+this.lowerCaseOutput = lowerCaseOutput;
+return this;
+}
+
+public NonTokenizingOptions build()
+{
+if (lowerCaseOutput && upperCaseOutput)
+throw new IllegalArgumentException("Options to normalize terms 
cannot be " +
+"both uppercase and lowercase at the same time");
+
+NonTokenizingOptions options = new NonTokenizingOptions();
+options.setCaseSensitive(caseSensitive);
+options.setUpperCaseOutput(upperCaseOutput);
+options.setLowerCaseOutput(lowerCaseOutput);
+return options;
+}
+}
+
+public static NonTokenizingOptions buildFromMap(Map 
optionsMap)
+{
+OptionsBuilder optionsBuilder = new OptionsBuilder();
+
+if (optionsMap.containsKey(CASE_SENSITIVE) && 
(optionsMap.containsKey(NORMALIZE_LOWERCASE)
+|| optionsMap.containsKey(NORMALIZE_UPPERCASE)))
+throw new IllegalArgumentException("case_sensitive option cannot 
be specified together " +
+"with either normalize_lowercase or normalize_uppercase");
+
+for (Map.Entry entry : optionsMap.entrySet())
+{
+switch (entry.getKey())
+{
+case NORMALIZE_LOWERCASE:
+{
+boolean bool = Boolean.parseBoolean(entry.getValue());
+optionsBuilder = optionsBuilder.lowerCaseOutput(bool);
+break;
+}
+case NORMALIZE_UPPERCASE:
+{
+boolean bool = Boolean.parseBoolean(entry.getValue());
+optionsBuilder = optionsBuilder.upperCaseOutput(bool);
+break;
+}
+case 

[07/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/src/java/org/apache/cassandra/index/sasi/utils/trie/PatriciaTrie.java
--
diff --git 
a/src/java/org/apache/cassandra/index/sasi/utils/trie/PatriciaTrie.java 
b/src/java/org/apache/cassandra/index/sasi/utils/trie/PatriciaTrie.java
new file mode 100644
index 000..3c672ec
--- /dev/null
+++ b/src/java/org/apache/cassandra/index/sasi/utils/trie/PatriciaTrie.java
@@ -0,0 +1,1261 @@
+/*
+ * Copyright 2005-2010 Roger Kapsi, Sam Berlin
+ *
+ *   Licensed under the Apache License, Version 2.0 (the "License");
+ *   you may not use this file except in compliance with the License.
+ *   You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *   Unless required by applicable law or agreed to in writing, software
+ *   distributed under the License is distributed on an "AS IS" BASIS,
+ *   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *   See the License for the specific language governing permissions and
+ *   limitations under the License.
+ */
+
+package org.apache.cassandra.index.sasi.utils.trie;
+
+import java.io.Serializable;
+import java.util.*;
+
+/**
+ * This class is taken from https://github.com/rkapsi/patricia-trie (v0.6), 
and slightly modified
+ * to correspond to Cassandra code style, as the only Patricia Trie 
implementation,
+ * which supports pluggable key comparators (e.g. commons-collections 
PatriciaTrie (which is based
+ * on rkapsi/patricia-trie project) only supports String keys)
+ * but unfortunately is not deployed to the maven central as a downloadable 
artifact.
+ */
+
+/**
+ * PATRICIA {@link Trie}
+ *  
+ * Practical Algorithm to Retrieve Information Coded in Alphanumeric
+ * 
+ * A PATRICIA {@link Trie} is a compressed {@link Trie}. Instead of storing 
+ * all data at the edges of the {@link Trie} (and having empty internal 
nodes), 
+ * PATRICIA stores data in every node. This allows for very efficient 
traversal, 
+ * insert, delete, predecessor, successor, prefix, range, and {@link 
#select(Object)} 
+ * operations. All operations are performed at worst in O(K) time, where K 
+ * is the number of bits in the largest item in the tree. In practice, 
+ * operations actually take O(A(K)) time, where A(K) is the average number of 
+ * bits of all items in the tree.
+ * 
+ * Most importantly, PATRICIA requires very few comparisons to keys while
+ * doing any operation. While performing a lookup, each comparison (at most 
+ * K of them, described above) will perform a single bit comparison against 
+ * the given key, instead of comparing the entire key to another key.
+ * 
+ * The {@link Trie} can return operations in lexicographical order using 
the 
+ * {@link #traverse(Cursor)}, 'prefix', 'submap', or 'iterator' methods. The 
+ * {@link Trie} can also scan for items that are 'bitwise' (using an XOR 
+ * metric) by the 'select' method. Bitwise closeness is determined by the 
+ * {@link KeyAnalyzer} returning true or false for a bit being set or not in 
+ * a given key.
+ * 
+ * Any methods here that take an {@link Object} argument may throw a 
+ * {@link ClassCastException} if the method is expecting an instance of K 
+ * and it isn't K.
+ * 
+ * @see http://en.wikipedia.org/wiki/Radix_tree;>Radix Tree
+ * @see http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA;>PATRICIA
+ * @see http://www.imperialviolet.org/binary/critbit.pdf;>Crit-Bit 
Tree
+ * 
+ * @author Roger Kapsi
+ * @author Sam Berlin
+ */
+public class PatriciaTrie extends AbstractPatriciaTrie implements 
Serializable
+{
+private static final long serialVersionUID = -2246014692353432660L;
+
+public PatriciaTrie(KeyAnalyzer keyAnalyzer)
+{
+super(keyAnalyzer);
+}
+
+public PatriciaTrie(KeyAnalyzer keyAnalyzer, Map m)
+{
+super(keyAnalyzer, m);
+}
+
+@Override
+public Comparator comparator()
+{
+return keyAnalyzer;
+}
+
+@Override
+public SortedMap prefixMap(K prefix)
+{
+return lengthInBits(prefix) == 0 ? this : new PrefixRangeMap(prefix);
+}
+
+@Override
+public K firstKey()
+{
+return firstEntry().getKey();
+}
+
+@Override
+public K lastKey()
+{
+TrieEntry entry = lastEntry();
+return entry != null ? entry.getKey() : null;
+}
+
+@Override
+public SortedMap headMap(K toKey)
+{
+return new RangeEntryMap(null, toKey);
+}
+
+@Override
+public SortedMap subMap(K fromKey, K toKey)
+{
+return new RangeEntryMap(fromKey, toKey);
+}
+
+@Override
+public SortedMap tailMap(K fromKey)
+{
+return new RangeEntryMap(fromKey, null);
+} 
+
+/**
+ * Returns an entry strictly higher than the given key,
+ * or null if no 

[02/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/unit/org/apache/cassandra/index/sasi/analyzer/NonTokenizingAnalyzerTest.java
--
diff --git 
a/test/unit/org/apache/cassandra/index/sasi/analyzer/NonTokenizingAnalyzerTest.java
 
b/test/unit/org/apache/cassandra/index/sasi/analyzer/NonTokenizingAnalyzerTest.java
new file mode 100644
index 000..ba67853
--- /dev/null
+++ 
b/test/unit/org/apache/cassandra/index/sasi/analyzer/NonTokenizingAnalyzerTest.java
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi.analyzer;
+
+import java.nio.ByteBuffer;
+
+import org.apache.cassandra.db.marshal.Int32Type;
+import org.apache.cassandra.db.marshal.UTF8Type;
+import org.apache.cassandra.utils.ByteBufferUtil;
+
+import org.junit.Assert;
+import org.junit.Test;
+
+/**
+ * Tests for the non-tokenizing analyzer
+ */
+public class NonTokenizingAnalyzerTest
+{
+@Test
+public void caseInsensitiveAnalizer() throws Exception
+{
+NonTokenizingAnalyzer analyzer = new NonTokenizingAnalyzer();
+NonTokenizingOptions options = 
NonTokenizingOptions.getDefaultOptions();
+options.setCaseSensitive(false);
+analyzer.init(options, UTF8Type.instance);
+
+String testString = "Nip it in the bud";
+ByteBuffer toAnalyze = ByteBuffer.wrap(testString.getBytes());
+analyzer.reset(toAnalyze);
+ByteBuffer analyzed = null;
+while (analyzer.hasNext())
+analyzed = analyzer.next();
+
Assert.assertTrue(testString.toLowerCase().equals(ByteBufferUtil.string(analyzed)));
+}
+
+@Test
+public void caseSensitiveAnalizer() throws Exception
+{
+NonTokenizingAnalyzer analyzer = new NonTokenizingAnalyzer();
+NonTokenizingOptions options = 
NonTokenizingOptions.getDefaultOptions();
+analyzer.init(options, UTF8Type.instance);
+
+String testString = "Nip it in the bud";
+ByteBuffer toAnalyze = ByteBuffer.wrap(testString.getBytes());
+analyzer.reset(toAnalyze);
+ByteBuffer analyzed = null;
+while (analyzer.hasNext())
+analyzed = analyzer.next();
+
Assert.assertFalse(testString.toLowerCase().equals(ByteBufferUtil.string(analyzed)));
+}
+
+@Test
+public void ensureIncompatibleInputSkipped() throws Exception
+{
+NonTokenizingAnalyzer analyzer = new NonTokenizingAnalyzer();
+NonTokenizingOptions options = 
NonTokenizingOptions.getDefaultOptions();
+analyzer.init(options, Int32Type.instance);
+
+ByteBuffer toAnalyze = ByteBufferUtil.bytes(1);
+analyzer.reset(toAnalyze);
+Assert.assertTrue(!analyzer.hasNext());
+}
+}

http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/unit/org/apache/cassandra/index/sasi/analyzer/StandardAnalyzerTest.java
--
diff --git 
a/test/unit/org/apache/cassandra/index/sasi/analyzer/StandardAnalyzerTest.java 
b/test/unit/org/apache/cassandra/index/sasi/analyzer/StandardAnalyzerTest.java
new file mode 100644
index 000..e307512
--- /dev/null
+++ 
b/test/unit/org/apache/cassandra/index/sasi/analyzer/StandardAnalyzerTest.java
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi.analyzer;

[09/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/src/java/org/apache/cassandra/index/sasi/plan/Operation.java
--
diff --git a/src/java/org/apache/cassandra/index/sasi/plan/Operation.java 
b/src/java/org/apache/cassandra/index/sasi/plan/Operation.java
new file mode 100644
index 000..1857c56
--- /dev/null
+++ b/src/java/org/apache/cassandra/index/sasi/plan/Operation.java
@@ -0,0 +1,477 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi.plan;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.*;
+
+import org.apache.cassandra.config.ColumnDefinition;
+import org.apache.cassandra.config.ColumnDefinition.Kind;
+import org.apache.cassandra.cql3.Operator;
+import org.apache.cassandra.db.filter.RowFilter;
+import org.apache.cassandra.db.rows.Row;
+import org.apache.cassandra.db.rows.Unfiltered;
+import org.apache.cassandra.index.sasi.conf.ColumnIndex;
+import org.apache.cassandra.index.sasi.analyzer.AbstractAnalyzer;
+import org.apache.cassandra.index.sasi.disk.Token;
+import org.apache.cassandra.index.sasi.plan.Expression.Op;
+import org.apache.cassandra.index.sasi.utils.RangeIntersectionIterator;
+import org.apache.cassandra.index.sasi.utils.RangeIterator;
+import org.apache.cassandra.index.sasi.utils.RangeUnionIterator;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.*;
+import org.apache.cassandra.utils.FBUtilities;
+
+public class Operation extends RangeIterator
+{
+public enum OperationType
+{
+AND, OR;
+
+public boolean apply(boolean a, boolean b)
+{
+switch (this)
+{
+case OR:
+return a | b;
+
+case AND:
+return a & b;
+
+default:
+throw new AssertionError();
+}
+}
+}
+
+private final QueryController controller;
+
+protected final OperationType op;
+protected final ListMultimap expressions;
+protected final RangeIterator range;
+
+protected Operation left, right;
+
+private Operation(OperationType operation,
+  QueryController controller,
+  ListMultimap expressions,
+  RangeIterator range,
+  Operation left, Operation right)
+{
+super(range);
+
+this.op = operation;
+this.controller = controller;
+this.expressions = expressions;
+this.range = range;
+
+this.left = left;
+this.right = right;
+}
+
+/**
+ * Recursive "satisfies" checks based on operation
+ * and data from the lower level members using depth-first search
+ * and bubbling the results back to the top level caller.
+ *
+ * Most of the work here is done by {@link #localSatisfiedBy(Unfiltered, 
boolean)}
+ * see it's comment for details, if there are no local expressions
+ * assigned to Operation it will call satisfiedBy(Row) on it's children.
+ *
+ * Query: first_name = X AND (last_name = Y OR address = XYZ AND street = 
IL AND city = C) OR (state = 'CA' AND country = 'US')
+ * Row: key1: (first_name: X, last_name: Z, address: XYZ, street: IL, 
city: C, state: NY, country:US)
+ *
+ * #1   OR
+ */\
+ * #2   (first_name) AND   AND (state, country)
+ *  \
+ * #3(last_name) OR
+ * \
+ * #4  AND (address, street, city)
+ *
+ *
+ * Evaluation of the key1 is top-down depth-first search:
+ *
+ * --- going down ---
+ * Level #1 is evaluated, OR expression has to pull results from it's 
children which are at level #2 and OR them together,
+ * Level #2 AND (state, country) could be be evaluated right away, AND 
(first_name) refers to it's "right" child from level #3
+ * Level #3 OR (last_name) requests results 

[08/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java
--
diff --git 
a/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java 
b/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java
new file mode 100644
index 000..0d2214a
--- /dev/null
+++ 
b/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi.utils;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+import java.util.PriorityQueue;
+
+import com.google.common.collect.Iterators;
+import org.apache.cassandra.io.util.FileUtils;
+
+import com.google.common.annotations.VisibleForTesting;
+
+public class RangeIntersectionIterator
+{
+protected enum Strategy
+{
+BOUNCE, LOOKUP, ADAPTIVE
+}
+
+public static , D extends CombinedValue> 
Builder builder()
+{
+return builder(Strategy.ADAPTIVE);
+}
+
+@VisibleForTesting
+protected static , D extends CombinedValue> 
Builder builder(Strategy strategy)
+{
+return new Builder<>(strategy);
+}
+
+public static class Builder, D extends 
CombinedValue> extends RangeIterator.Builder
+{
+private final Strategy strategy;
+
+public Builder(Strategy strategy)
+{
+super(IteratorType.INTERSECTION);
+this.strategy = strategy;
+}
+
+protected RangeIterator buildIterator()
+{
+// if the range is disjoint we can simply return empty
+// iterator of any type, because it's not going to produce any 
results.
+if (statistics.isDisjoint())
+return new BounceIntersectionIterator<>(statistics, new 
PriorityQueue>(1));
+
+switch (strategy)
+{
+case LOOKUP:
+return new LookupIntersectionIterator<>(statistics, 
ranges);
+
+case BOUNCE:
+return new BounceIntersectionIterator<>(statistics, 
ranges);
+
+case ADAPTIVE:
+return statistics.sizeRatio() <= 0.01d
+? new LookupIntersectionIterator<>(statistics, 
ranges)
+: new BounceIntersectionIterator<>(statistics, 
ranges);
+
+default:
+throw new IllegalStateException("Unknown strategy: " + 
strategy);
+}
+}
+}
+
+private static abstract class AbstractIntersectionIterator, D extends CombinedValue> extends RangeIterator
+{
+protected final PriorityQueue> ranges;
+
+private AbstractIntersectionIterator(Builder.Statistics 
statistics, PriorityQueue> ranges)
+{
+super(statistics);
+this.ranges = ranges;
+}
+
+public void close() throws IOException
+{
+for (RangeIterator range : ranges)
+FileUtils.closeQuietly(range);
+}
+}
+
+/**
+ * Iterator which performs intersection of multiple ranges by using 
bouncing (merge-join) technique to identify
+ * common elements in the given ranges. Aforementioned "bounce" works as 
follows: range queue is poll'ed for the
+ * range with the smallest current token (main loop), that token is used 
to {@link RangeIterator#skipTo(Comparable)}
+ * other ranges, if token produced by {@link 
RangeIterator#skipTo(Comparable)} is equal to current "candidate" token,
+ * both get merged together and the same operation is repeated for next 
range from the queue, if returned token
+ * is not equal than candidate, candidate's range gets put back into the 
queue and the main loop gets repeated until
+ * next intersection token is found or at least one iterator runs out of 
tokens.
+ *
+ * This technique is every efficient to jump over gaps in 

[10/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java
--
diff --git a/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java 
b/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java
new file mode 100644
index 000..5d85d00
--- /dev/null
+++ b/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java
@@ -0,0 +1,519 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi.disk;
+
+import java.io.IOException;
+import java.util.*;
+
+import org.apache.cassandra.db.DecoratedKey;
+import org.apache.cassandra.index.sasi.utils.AbstractIterator;
+import org.apache.cassandra.index.sasi.utils.CombinedValue;
+import org.apache.cassandra.index.sasi.utils.MappedBuffer;
+import org.apache.cassandra.index.sasi.utils.RangeIterator;
+import org.apache.cassandra.utils.MergeIterator;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Function;
+import com.google.common.collect.Iterators;
+import org.apache.commons.lang3.builder.HashCodeBuilder;
+
+import static org.apache.cassandra.index.sasi.disk.TokenTreeBuilder.EntryType;
+
+// Note: all of the seek-able offsets contained in TokenTree should be 
sizeof(long)
+// even if currently only lower int portion of them if used, because that makes
+// it possible to switch to mmap implementation which supports long positions
+// without any on-disk format changes and/or re-indexing if one day we'll have 
a need to.
+public class TokenTree
+{
+private static final int LONG_BYTES = Long.SIZE / 8;
+private static final int SHORT_BYTES = Short.SIZE / 8;
+
+private final Descriptor descriptor;
+private final MappedBuffer file;
+private final long startPos;
+private final long treeMinToken;
+private final long treeMaxToken;
+private final long tokenCount;
+
+@VisibleForTesting
+protected TokenTree(MappedBuffer tokenTree)
+{
+this(Descriptor.CURRENT, tokenTree);
+}
+
+public TokenTree(Descriptor d, MappedBuffer tokenTree)
+{
+descriptor = d;
+file = tokenTree;
+startPos = file.position();
+
+file.position(startPos + TokenTreeBuilder.SHARED_HEADER_BYTES);
+
+if (!validateMagic())
+throw new IllegalArgumentException("invalid token tree");
+
+tokenCount = file.getLong();
+treeMinToken = file.getLong();
+treeMaxToken = file.getLong();
+}
+
+public long getCount()
+{
+return tokenCount;
+}
+
+public RangeIterator iterator(Function 
keyFetcher)
+{
+return new TokenTreeIterator(file.duplicate(), keyFetcher);
+}
+
+public OnDiskToken get(final long searchToken, Function keyFetcher)
+{
+seekToLeaf(searchToken, file);
+long leafStart = file.position();
+short leafSize = file.getShort(leafStart + 1); // skip the info byte
+
+file.position(leafStart + TokenTreeBuilder.BLOCK_HEADER_BYTES); // 
skip to tokens
+short tokenIndex = searchLeaf(searchToken, leafSize);
+
+file.position(leafStart + TokenTreeBuilder.BLOCK_HEADER_BYTES);
+
+OnDiskToken token = OnDiskToken.getTokenAt(file, tokenIndex, leafSize, 
keyFetcher);
+return token.get().equals(searchToken) ? token : null;
+}
+
+private boolean validateMagic()
+{
+switch (descriptor.version.toString())
+{
+case Descriptor.VERSION_AA:
+return true;
+case Descriptor.VERSION_AB:
+return TokenTreeBuilder.AB_MAGIC == file.getShort();
+default:
+return false;
+}
+}
+
+// finds leaf that *could* contain token
+private void seekToLeaf(long token, MappedBuffer file)
+{
+// this loop always seeks forward except for the first iteration
+// where it may seek back to the root
+long blockStart = startPos;
+while (true)
+{
+file.position(blockStart);
+
+byte info = file.get();
+  

[01/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
Repository: cassandra
Updated Branches:
  refs/heads/trunk 11c8ca6b5 -> 72790dc8e


http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/unit/org/apache/cassandra/index/sasi/plan/OperationTest.java
--
diff --git a/test/unit/org/apache/cassandra/index/sasi/plan/OperationTest.java 
b/test/unit/org/apache/cassandra/index/sasi/plan/OperationTest.java
new file mode 100644
index 000..92fbf69
--- /dev/null
+++ b/test/unit/org/apache/cassandra/index/sasi/plan/OperationTest.java
@@ -0,0 +1,645 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi.plan;
+
+import java.nio.ByteBuffer;
+import java.util.*;
+import java.util.concurrent.TimeUnit;
+
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Multimap;
+import com.google.common.collect.Sets;
+import org.apache.cassandra.SchemaLoader;
+import org.apache.cassandra.config.CFMetaData;
+import org.apache.cassandra.config.ColumnDefinition;
+import org.apache.cassandra.cql3.Operator;
+import org.apache.cassandra.db.*;
+import org.apache.cassandra.db.filter.RowFilter;
+import org.apache.cassandra.db.marshal.DoubleType;
+import org.apache.cassandra.db.rows.*;
+import org.apache.cassandra.index.sasi.plan.Operation.OperationType;
+import org.apache.cassandra.db.marshal.Int32Type;
+import org.apache.cassandra.db.marshal.LongType;
+import org.apache.cassandra.db.marshal.UTF8Type;
+import org.apache.cassandra.exceptions.ConfigurationException;
+import org.apache.cassandra.schema.KeyspaceMetadata;
+import org.apache.cassandra.schema.KeyspaceParams;
+import org.apache.cassandra.schema.Tables;
+import org.apache.cassandra.service.MigrationManager;
+import org.apache.cassandra.utils.FBUtilities;
+
+import org.junit.*;
+
+public class OperationTest extends SchemaLoader
+{
+private static final String KS_NAME = "sasi";
+private static final String CF_NAME = "test_cf";
+private static final String CLUSTERING_CF_NAME = "clustering_test_cf";
+
+private static ColumnFamilyStore BACKEND;
+private static ColumnFamilyStore CLUSTERING_BACKEND;
+
+@BeforeClass
+public static void loadSchema() throws ConfigurationException
+{
+System.setProperty("cassandra.config", "cassandra-murmur.yaml");
+SchemaLoader.loadSchema();
+MigrationManager.announceNewKeyspace(KeyspaceMetadata.create(KS_NAME,
+ 
KeyspaceParams.simpleTransient(1),
+ 
Tables.of(SchemaLoader.sasiCFMD(KS_NAME, CF_NAME),
+   
SchemaLoader.clusteringSASICFMD(KS_NAME, CLUSTERING_CF_NAME;
+
+BACKEND = Keyspace.open(KS_NAME).getColumnFamilyStore(CF_NAME);
+CLUSTERING_BACKEND = 
Keyspace.open(KS_NAME).getColumnFamilyStore(CLUSTERING_CF_NAME);
+}
+
+private QueryController controller;
+
+@Before
+public void beforeTest()
+{
+controller = new QueryController(BACKEND,
+ 
PartitionRangeReadCommand.allDataRead(BACKEND.metadata, 
FBUtilities.nowInSeconds()),
+ TimeUnit.SECONDS.toMillis(10));
+}
+
+@After
+public void afterTest()
+{
+controller.finish();
+}
+
+@Test
+public void testAnalyze() throws Exception
+{
+final ColumnDefinition firstName = 
getColumn(UTF8Type.instance.decompose("first_name"));
+final ColumnDefinition age = 
getColumn(UTF8Type.instance.decompose("age"));
+final ColumnDefinition comment = 
getColumn(UTF8Type.instance.decompose("comment"));
+
+// age != 5 AND age > 1 AND age != 6 AND age <= 10
+Map expressions = 
convert(Operation.analyzeGroup(controller, OperationType.AND,
+   
 Arrays.asList(new SimpleExpression(age, Operator.NEQ, 
Int32Type.instance.decompose(5)),
+   

[03/14] cassandra git commit: Integrate SASI index into Cassandra

2016-01-23 Thread xedin
http://git-wip-us.apache.org/repos/asf/cassandra/blob/72790dc8/test/unit/org/apache/cassandra/index/sasi/SASIIndexTest.java
--
diff --git a/test/unit/org/apache/cassandra/index/sasi/SASIIndexTest.java 
b/test/unit/org/apache/cassandra/index/sasi/SASIIndexTest.java
new file mode 100644
index 000..cb5ec73
--- /dev/null
+++ b/test/unit/org/apache/cassandra/index/sasi/SASIIndexTest.java
@@ -0,0 +1,1852 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.index.sasi;
+
+import java.nio.ByteBuffer;
+import java.util.*;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import org.apache.cassandra.SchemaLoader;
+import org.apache.cassandra.config.CFMetaData;
+import org.apache.cassandra.config.ColumnDefinition;
+import org.apache.cassandra.config.DatabaseDescriptor;
+import org.apache.cassandra.cql3.*;
+import org.apache.cassandra.cql3.Term;
+import org.apache.cassandra.cql3.statements.IndexTarget;
+import org.apache.cassandra.cql3.statements.SelectStatement;
+import org.apache.cassandra.db.*;
+import org.apache.cassandra.db.filter.ColumnFilter;
+import org.apache.cassandra.db.filter.DataLimits;
+import org.apache.cassandra.db.filter.RowFilter;
+import org.apache.cassandra.db.marshal.*;
+import org.apache.cassandra.db.partitions.PartitionUpdate;
+import org.apache.cassandra.db.partitions.UnfilteredPartitionIterator;
+import org.apache.cassandra.db.rows.*;
+import org.apache.cassandra.dht.IPartitioner;
+import org.apache.cassandra.dht.Murmur3Partitioner;
+import org.apache.cassandra.dht.Range;
+import org.apache.cassandra.exceptions.ConfigurationException;
+import org.apache.cassandra.index.sasi.conf.ColumnIndex;
+import org.apache.cassandra.index.sasi.disk.OnDiskIndexBuilder;
+import org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException;
+import org.apache.cassandra.index.sasi.plan.QueryPlan;
+import org.apache.cassandra.schema.IndexMetadata;
+import org.apache.cassandra.schema.KeyspaceMetadata;
+import org.apache.cassandra.schema.KeyspaceParams;
+import org.apache.cassandra.schema.Tables;
+import org.apache.cassandra.serializers.MarshalException;
+import org.apache.cassandra.serializers.TypeSerializer;
+import org.apache.cassandra.service.MigrationManager;
+import org.apache.cassandra.service.QueryState;
+import org.apache.cassandra.thrift.CqlRow;
+import org.apache.cassandra.transport.messages.ResultMessage;
+import org.apache.cassandra.utils.ByteBufferUtil;
+import org.apache.cassandra.utils.FBUtilities;
+import org.apache.cassandra.utils.Pair;
+
+import com.google.common.collect.Lists;
+import com.google.common.util.concurrent.Uninterruptibles;
+
+import junit.framework.Assert;
+
+import org.junit.*;
+
+public class SASIIndexTest
+{
+private static final IPartitioner PARTITIONER = new Murmur3Partitioner();
+
+private static final String KS_NAME = "sasi";
+private static final String CF_NAME = "test_cf";
+private static final String CLUSTRING_CF_NAME = "clustering_test_cf";
+
+@BeforeClass
+public static void loadSchema() throws ConfigurationException
+{
+System.setProperty("cassandra.config", "cassandra-murmur.yaml");
+SchemaLoader.loadSchema();
+MigrationManager.announceNewKeyspace(KeyspaceMetadata.create(KS_NAME,
+ 
KeyspaceParams.simpleTransient(1),
+ 
Tables.of(SchemaLoader.sasiCFMD(KS_NAME, CF_NAME),
+   
SchemaLoader.clusteringSASICFMD(KS_NAME, CLUSTRING_CF_NAME;
+}
+
+@After
+public void cleanUp()
+{
+
Keyspace.open(KS_NAME).getColumnFamilyStore(CF_NAME).truncateBlocking();
+}
+
+@Test
+public void testSingleExpressionQueries() throws Exception
+{
+testSingleExpressionQueries(false);
+cleanupData();
+