[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-02-10 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140749#comment-15140749
 ] 

Aleksey Yeschenko commented on CASSANDRA-8180:
--

Committed as 
[b11fba750c610de5e97acba070cc571cf0a96416|https://github.com/apache/cassandra/commit/b11fba750c610de5e97acba070cc571cf0a96416]
 to trunk (3.4), thanks.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.4
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 | 60
>Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |277
>

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-02-09 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140087#comment-15140087
 ] 

Stefania commented on CASSANDRA-8180:
-

CI looks good, this is ready to commit.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 | 60
>Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |277
>   Executing single-partition query on test | 
> 23:52:31,889 | 127.0.0.1 |961
>

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-02-08 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15138226#comment-15138226
 ] 

Stefania commented on CASSANDRA-8180:
-

Thanks for the review Branimir! 

The patch for trunk is 
[here|https://github.com/stef1927/cassandra/commits/8180], rebased and squashed.

Launched one more CI round, if all good will move the ticket to ready to commit:

||trunk||
|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-8180-testall/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-8180-dtest/]|


> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-02-08 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136987#comment-15136987
 ] 

Branimir Lambov commented on CASSANDRA-8180:


Changes look good and patch looks good.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 | 60
>Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |277
>   Executing single-partition query on test | 
> 23:52:31,889 | 127.0.0.1 |961
>  

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-31 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125613#comment-15125613
 ] 

Stefania commented on CASSANDRA-8180:
-

Thank you for your comments, refer to [this 
commit|https://github.com/stef1927/cassandra/commit/a180f02ebf7679fac76a27cf7be752f310d343ff]:
 

* Re-introduced a unique list of iterators and we filter by iterator type 
before counting the sstables iterated. 
* Overridden {{UnfilteredRowIteratorWithLowerBound.staticRow}}, and yes the 
problem was the single-source case. 
* Replaced the inline {{Transformation}} instance with a named class.
* Modified the existing tests for static columns to also test when flushing to 
sstables 
* Fixed some trivial warnings and removed some duplication in test code 
({{flush(boolean forceFlush)}}).

CI is still running. I will also need to squash before we can flag as ready to 
commit.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123535#comment-15123535
 ] 

Benedict commented on CASSANDRA-8180:
-

One other nit: for debugging it's much clearer if we create named instances of 
{{Transformation}} - these can be declared inline in the method, so it's only 
one extra line of code.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 | 60
>Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |277
> 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-29 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123275#comment-15123275
 ] 

Branimir Lambov commented on CASSANDRA-8180:


LGTM.

Nits (feel free to ignore):

You can use {{count}} instead of {{mapToInt(e -> 1).sum()}}.

The {{ListUtils}} terminology is strange, {{union}} is the concatenation and 
{{sum}} is the union? It's also not very efficient. I don't think the two 
separate lists make a lot of sense versus using one with an {{instanceof}} 
check in {{onPartitionClose}}.

bq. BaseRows eagerly caches the static row in its constructor

Isn't {{mergeIterator.mergeStaticRows}} the one that does this? Or is the 
problem the single-source case? In any case, you could override 
{{UnfilteredRowIteratorWithLowerBound.staticRow}} to return empty if no static 
columns are required.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-28 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121128#comment-15121128
 ] 

Stefania commented on CASSANDRA-8180:
-

Latest commit available 
[here|https://github.com/stef1927/cassandra/commit/a0e779d5970b764f2732c387067218407617b4ab].

bq. I am worried about the partitionLevelDeletion > minTimestamp check for 
tables with tombstones, though. It seems that the code is still willing to skip 
a table with tombstones if it doesn't delete the whole partition. Could that be 
right?

You are right to be worried, this test fails on trunk:

{code}
createTable("CREATE TABLE %s (a int, b int, c text, primary key (a, b))");

for (int i = 0; i < 10; i++)
execute("INSERT INTO %s (a, b, c) VALUES(1, ?, 'abc')", i);
flush();

execute("DELETE FROM %s WHERE a=1 and b <= 3");
flush();

assertEmpty(execute("SELECT * FROM %s WHERE a=1 and b <= 2"));
{code}

I git-annotated the original code in 2.2, which is in {{CollationController}} 
at line 293 and traced it to CASSANDRA-5514. I think the comment explaining it 
is 
[this|https://issues.apache.org/jira/browse/CASSANDRA-5514?focusedCommentId=13645632=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13645632].
 However, perhaps back than range tombstones did not exist? BTW, [this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-5514?focusedCommentId=13656934=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13656934]
 explains quite well the _raison d'ĂȘtre_ of the min and max clustering values 
in the stats.

bq. Shouldn't minTimestamp be updated only if shouldInclude()?

I agree and we should also update {{minTimestamp}} by looking at memtables as 
well now that CASSANDRA-9949 is available. 

bq. The trace on merging doesn't seem to be using the right value.

Done, thanks.

bq. You should use Transformation for withSSTablesIterated, and I'd actually do 
isEmpty() explicitly-- the new code is harder to read but I don't think it 
improves anything.

Done. Because {{BaseRows}} eagerly caches the static row in its constructor, I 
had to change one thing: rather than counting an sstable as iterated when the 
iterator is initialized, we do so when {{computeNext()}} is called. I think 
this is fair since the constructor only accesses the data partition in very 
limited cases as discussed above. However we do access the partition index, 
which prior to this we would only do for sstables with tombstones. If this is 
not acceptable then either we don't use {{Transformation}} or we change 
{{BaseRows}}.

I've also attached a boolean to the iterator itself rather than counting the 
tables via {{sstablesIterated}} and I was therefore able to get rid of 
{{QueryMemtableAndDiskStatus}}. It seems a bit cleaner like this. I called the 
method returning that boolean {{iterated()}} but perhaps we can find a better 
name.

bq. CQLTester.cfs() is redundant (use getCurrentColumnFamilyStore()).

Done, thanks.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-27 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118892#comment-15118892
 ] 

Stefania commented on CASSANDRA-8180:
-

The latest commit is ready for review 
[here|https://github.com/stef1927/cassandra/commit/fb1a4d93dc101143eb3d4062d577e643e8824c73].
 CI is still pending however.

bq. Isn't the static row is outside the iteration?

That's a very good point, thanks for pointing this out. So we don't need to 
worry about this.

bq. So we can't really trust hasTombstones()? This may be a significant bug.

This method was introduced in this patch but currently the same logic is used 
in {{SinglePartitionReadCommand.queryMemtableAndDiskInternal()}} and 
{{queryMemtableAndSSTablesInTimestampOrder}} to determine which non overlapping 
sstables contain no tombstones at all, see especially the comment in the second 
method:

{code}
// This mean that nothing queried by the filter can be in the sstable. 
One exception is the top-level partition deletion
// however: if it is set, it impacts everything and must be included. 
Getting that top-level partition deletion costs us
// some seek in general however (unless the partition is indexed and is 
in the key cache), so we first check if the sstable
// has any tombstone at all as a shortcut.
if (sstable.getSSTableMetadata().maxLocalDeletionTime == 
Integer.MAX_VALUE)
continue; // Means no tombstone at all, we can skip that sstable
{code}

The trouble is that in other places we assume the opposite, for example 
{{CompactionController.getFullyExpiredSSTables()}} requires {{maxDeletionTime}} 
to be {{Integer.MAX_VALUE}} in the presence of live cells or else it might 
expire the sstable. I think the code in {{SinglePartitionReadCommand}} is wrong 
and we should have looked at the {{minDeletionTime}} instead, so I've updated 
the new method accordingly.

I've had some failures with upgrade dtests, due to the clustering values of old 
sstables ("jb" format) containing more values than the comparator types. I've 
added a check on {{Version.hasNewStatsFile()}} to prevent this and some 
assertions to spot it sooner, and this seems to have fixed the failing tests.

To wrap up the discussion on upgrading sstable stats, this would allow applying 
this optimization also to sstables with tombstones and that are smaller than 
{{DatabaseDescriptor.column_index_size_in_kb}}, 64kb by default. For larger 
sstables with tombstones, the lower bound comes from the partition index and 
this will be read anyway in the presence of tombstones when 
{{partitionDeletion()}} is called. This requires also initializing the 
iterator, which will cause a disk seek to the partition start if the partition 
index is not available (sstables smaller than 64kb) or there are static columns 
in the query, see the {{AbstractSSTableIterator}} constructor. 

Therefore I don't see the point in upgrading the stats, but should we wish to 
do it, here is the work required:
- upgrade the sstable version from "ma" to "mb" (unsure what this entails 
exactly, other than changing {{BigVersion}} and {{BigFormat}})
- introduce a serializer for clustering prefixes that is able to distinguish 
between clusterings and bounds by using functionality similar to what's in 
{{UnfilteredSerializer}}
- pass the clustering types, or ideally the {{SerializationHeader}}, to the 
{{StatsMetadata}} serialize and deserialize methods - or alternatively add the 
min and max clustering prefixes to {{SerializationHeader.Component}}, which is 
also a metadata component. However in this case we'd have to do some more work 
to pass to it the min and max clustering prefixes, which can instead be easily 
collected for {{StatsMetadata}} by using the existing method 
{{MetadataCollector.updateClusteringValues()}}. I suppose they could also go 
into {{EncodingStats}}, which is serialized in the header, but I am not sure if 
this would increase the size of partitions excessively because it seems we also 
use {{EncodingStats}} for partitions.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-27 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119159#comment-15119159
 ] 

Branimir Lambov commented on CASSANDRA-8180:


Thank you, this now looks almost ready to go. This is not a problem of the new 
code, but I am worried about the [{{partitionLevelDeletion > 
minTimestamp}}|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-2e17efa5977a71330df6651d3bec0d12R587]
 check for tables with tombstones, though. It seems that the code is still 
willing to skip a table with tombstones if it doesn't delete the whole 
partition. Could that be right?

Nits:

Shouldn't 
[{{minTimestamp}}|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-2e17efa5977a71330df6651d3bec0d12R549]
 be updated only if {{shouldInclude()}}?

[The trace on 
merging|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-2e17efa5977a71330df6651d3bec0d12R607]
 doesn't seem to be using the right value.

You should use {{Transformation}} for 
[{{withSSTablesIterated}}|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-2e17efa5977a71330df6651d3bec0d12R664],
 and I'd actually do {{isEmpty()}} explicitly-- the new code is harder to read 
but I don't think it improves anything.

[{{CQLTester.cfs()}}|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-8693101ca0a55a372ef784230968648aR397]
 is redundant (use {{getCurrentColumnFamilyStore()}}).


> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-26 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117151#comment-15117151
 ] 

Branimir Lambov commented on CASSANDRA-8180:


bq. Further, it seems that static columns could also have problems with 
clustering values so it's not just tombstones. See the comment in 
{{SinglePartitionReadCommand.shouldInclude()}}.

Isn't the static row is outside the iteration? Unless I'm very mistaken It 
shouldn't enter {{MergeIterator}} at all, and thus the bound needn't be 
corrected for it.

Additionally, if static rows are needed the merge will query each iterator's 
{{staticRow()}} method which will initialize the iterator.

bq. It is enabled when there are no tombsones since the iterator is null.

Sorry, my bad, I have overlooked {{partitionLevelDeletion()}} not calling the 
super method in that case.

bq. It's once again the metadata that is wrong: maxLocalDeletionTime is 
Integer.MAX_VALUE even though there are tombstones in the sstable, as a 
consequence sstable.hasTombstones() returns an incorrect value.

So we can't really trust {{hasTombstones()}}? This may be a significant bug.



> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-25 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116655#comment-15116655
 ] 

Stefania commented on CASSANDRA-8180:
-

bq. That is, your code is currently completely disabled by the {{if (... 
iterator != null) return null; check in lowerBound()?}} Is there a reason to 
want the code committed then?

It is enabled when there are no tombsones since the iterator is null. However 
we can do better, when there are tombstones we can still use the row index 
bound if there is one.


bq. On the other hand, an assertion in {{UnfilteredIteratorWithLowerBound}} 
that saves the returned {{lowerBound()}} and checks the first {{next()}} to be 
greater-than or equal may be something we want to have anyway.

I've added this. Whilst the test you proposed is fine, existing tests (e.g. 
{{DeleteTest.testDeleteWithRangeAndOneClusteringColumn()}}) started to fail but 
_only when compaction is enabled_. It seems we can have sstables with no 
tombstones, valid clustering values (e.g. 0) and yet the first item is a 
{{Bound.BOTTOM}}. 

Further, it seems that static columns could also have problems with clustering 
values so it's not just tombstones. See the comment in 
{{SinglePartitionReadCommand.shouldInclude()}}. 

Moving back to in progress, might end up modifying sstable metadata format 
after all.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-25 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116897#comment-15116897
 ] 

Stefania commented on CASSANDRA-8180:
-

I think the issue is in {{MetadataCollector}}:

{code}
public void update(Cell cell)
{
updateTimestamp(cell.timestamp());
updateTTL(cell.ttl());
updateLocalDeletionTime(cell.localDeletionTime());
}
{code}

Shouldn't we check that the cell is a tombstone before updating the local 
deletion time?

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 | 60

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-25 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116781#comment-15116781
 ] 

Stefania commented on CASSANDRA-8180:
-

bq. It seems we can have sstables with no tombstones, valid clustering values 
(e.g. 0) and yet the first item is a Bound.BOTTOM. 

It's once again the metadata that is wrong: {{maxLocalDeletionTime}} is 
{{Integer.MAX_VALUE}} even though there are tombstones in the sstable, as a 
consequence {{sstable.hasTombstones()}} returns an incorrect value.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-25 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115132#comment-15115132
 ] 

Branimir Lambov commented on CASSANDRA-8180:


bq. iter.partitionLevelDeletion() is currently called for all sstables by 
queryMemtableAndDiskInternal()

That is, your code is currently completely disabled by the [{{if (... iterator 
!= null) return 
null;}}|https://github.com/stef1927/cassandra/commit/d5cfc6fd56d50eda5d9c510591bae1d66e17ec59#diff-78de604e500e1cc63c6d53a2ac6d6d65R52]
 check in {{lowerBound()}}? Is there a reason to want the code committed then?


bq. We can replace BTreeRow.emptyRow(ret) with new 
RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(),
 ret.getRawValues()), DeletionTime.LIVE) if there is still a valid reason and 
ideally a failing test would be useful.

It is not as easy to fully break it as I was expecting, but in the presence of 
tombstones you can still break a basic feature of the iterators -- the 
inequality of the returned elements. A test that does
{code}
createTable("CREATE TABLE %s (a int, b int, c text, primary key (a, 
b))");
execute("INSERT INTO %s (a, b, c) VALUES(1, 1, '1')");
execute("INSERT INTO %s (a, b, c) VALUES(1, 3, '3')");
execute("DELETE FROM %s where a=1 and b >= 2 and b <= 3");
execute("INSERT INTO %s (a, b, c) VALUES(1, 2, '2')");
flush();

execute("DELETE FROM %s where a=1 and b >= 2 and b <= 3");
flush();

execute("SELECT * FROM %s WHERE a = ?", 1);
{code}
will end up with an iterator that lists two tombstone markers with equal 
clustering. Unfortunately that's filtered out before being returned so it's not 
trivial to write a test that checks this. On the other hand, an assertion in 
{{UnfilteredIteratorWithLowerBound}} that saves the returned {{lowerBound()}} 
and checks the first {{next()}} to be greater-than or equal may be something we 
want to have anyway.


> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>   

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-25 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114909#comment-15114909
 ] 

Stefania commented on CASSANDRA-8180:
-

Pushed a 
[commit|https://github.com/stef1927/cassandra/commit/d5cfc6fd56d50eda5d9c510591bae1d66e17ec59]
 where we don't use the lower bound in the presence of tombstones and 
{{DeleteTest}} is now passing. CI is still pending however. I would like to 
point out that {{iter.partitionLevelDeletion()}} is currently called for all 
sstables by {{queryMemtableAndDiskInternal()}} and therefore, in the presence 
of tombstones, we have to access the sstable anyway.

bq. Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, 
ck2) will generate one.

This case existed already in 
{{DeleteTest.testDeleteWithRangeAndTwoClusteringColumns()}}. It does not fail 
though, because the clustering comparator compares prefix values from first to 
last and so it works fine with incomplete prefixes.

bq. An empty row will not work correctly as a lower bound. It does not sort as 
needed with respect to tombstone bounds, which should also be included in the 
test (more specifically, one that adds a row, flushes, deletes same row, 
flushes again, then checks if it resurfaces-- I believe this would break with 
the current code). Use a RangeTombstoneBound with DeletionTime.LIVE as the 
deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which 
should do the right thing in both directions.

I don't think this matters any longer if we never replace a tombstone with a 
lower bound but, I could not reproduce any failures even when using a lower 
bound for tombstones (other than for one sided range tombstones). Here is what 
I tried:

- add a row, flush, delete same row, flush again => FINE (we don't even use 
merge iterator)
- add two rows in same partition, flush, delete one and flush, delete other one 
and flush, FINE again
- add three rows in same partition, flush, delete one and flush, delete other 
one and flush, FINE again

We can replace {{BTreeRow.emptyRow(ret)}} with {{new 
RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(),
 ret.getRawValues()), DeletionTime.LIVE)}} if there is still a valid reason and 
ideally a failing test would be useful. I will clean up {{lowerBound()}} and 
add some comments to it in the next round once we have reached a decision.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-21 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110307#comment-15110307
 ] 

Branimir Lambov commented on CASSANDRA-8180:


bq. What is an example of any other incomplete prefix and do we have a gap in 
the tests then?

Tombstones. A {{DELETE WHERE pk = ? AND ck1 = ?}} in a table with key {{(pk, 
ck1, ck2)}} will generate one.

bq. What I don't understand is how things like shouldInclude() in 
ClusteringIndexNamesFilter or ClusteringIndexSliceFilter work.

If you look at the callsites for the method, you will see that they do more 
work in the presence of tombstones. So one solution is not to use the 
{{min/maxClusteringValues}} in that case.

bq. \[MetadataSerializer.deserialize()\] should receive the total size to work 
out if there is more stuff to read at the end.

No need for that, you can set a flag in {{Version}} to tell you whether or not 
the information is present.

bq. I'm not sure what you mean, \[use a RangeTombstoneBound\] for the test or 
the fix?

This is the fix. Instead of an empty row, the lower bound should be a 
{{RangeTombstoneBound}} as described.

bq. The global lower bound is free, since it is available in the metadata. The 
index lower bound is more accurate but it requires seeking the index file.

In the way you use this class, by the time {{lowerBound()}} is called, all of 
this is already done (by {{UnfilteredRowMergeIterator.create}}), possibly 
unnecessarily (if {{MergeIterator.OneToOne}} is to be used). I would just move 
finding the bound to {{lowerBound()}}, and I don't think it's even necessary to 
save the bound-- just retrieve it there, the method won't be called more than 
once.


> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-21 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110332#comment-15110332
 ] 

Stefania commented on CASSANDRA-8180:
-

bq. Tombstones. A {{DELETE WHERE pk = ? AND ck1 = ?}} in a table with key 
{{(pk, ck1, ck2)}} will generate one.

I will add this test, thanks.

bq. So one solution is not to use the min/maxClusteringValues in that case.

So maybe we could simply return a null lower bound in the presence of 
tombstones or is this too much of a compromise?

bq. No need for that, you can set a flag in Version to tell you whether or not 
the information is present.

Doesn't this only indicate a different sstable version ("la", "ma", etc)?

bq. This is the fix. Instead of an empty row, the lower bound should be a 
RangeTombstoneBound as described.

Thanks.

bq.  I would just move finding the bound to lowerBound(), and I don't think 
it's even necessary to save the bound

OK

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-20 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109971#comment-15109971
 ] 

Stefania commented on CASSANDRA-8180:
-

Thank you for the review.

bq. updateClusteringValues() doesn't produce what you need, as it is not just 
TOP and BOTTOM that are mishandled, any incomplete prefix will also be. What 
you need is an actual min/maxClusteringPrefix which is different from the 
min/max of each component separately.

I agree that we need a full clustering prefix. What I don't understand is how 
things like {{shouldInclude()}} in {{ClusteringIndexNamesFilter}} or 
{{ClusteringIndexSliceFilter}} work. What is an example of any other incomplete 
prefix and do we have a gap in the tests then?

bq. I don't know if it is possible to append the data you need at the end of 
the stats component and have earlier versions happily ignore that data.

There are 4 components that we write in the same file: VALIDATION, STATS, 
COMPACTION and HEADER so we can't simply keep on reading till the end of the 
file. However we write a TOC with the position of each component. So it should 
be possible but it would require changes to 
{{MetadataSerializer.deserialize()}} and the signature of 
{{IMetadataComponentSerializer.deserialize()}}, which should receive the total 
size to work out if there is more stuff to read at the end. I guess we can go 
for it.

bq. I personally would prefer to not modify the behaviour of MergeIterator and 
keep it doing one simple thing, but this approach does have its charm.

The  changes to MergeIterator are mostly in the candidate and really minimal, 
the actual algorithm is unchanged. However if you do have a less invasive 
approach in mind I'm eager to hear it.

bq. An empty row will not work correctly as a lower bound. It does not sort as 
needed with respect to tombstone bounds, which should also be included in the 
test (more specifically, one that adds a row, flushes, deletes same row, 
flushes again, then checks if it resurfaces-- I believe this would break with 
the current code). 

Thanks, I'll add this test.

bq. Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a 
bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right 
thing in both directions.

I'm not sure what you mean, do this for the test or the fix? I'm sure I'll work 
it out when I write the test though.

bq. IMergeIterator.LowerBound is cryptic, rename it to IteratorWithLowerBound 
to be explicit about its purpose.

OK

bq. The choice to set rowIndexLowerBound in partitionLevelDeletion() appears 
very arbitrary and fragile. What is the reason to do it separately from 
globalLowerBound? In fact, why have two separate bounds instead of one, set 
from the most precise information that is available at construction time?

The global lower bound is free, since it is available in the metadata. The 
index lower bound is more accurate but it requires seeking the index file. 
Calling {{super.partitionLevelDeletion()}} also involves initializing the 
iterator and accessing the data file (AbstractSSTableIterator constructor). So 
we decided to use this more accurate bound only when we really have to access 
the index anyway, that is when partitionLevelDeletion() is called and there are 
tombstones. See [this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-8180?focusedCommentId=14388301=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14388301]
 above.

I hope to resume this in the next few days.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-20 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109052#comment-15109052
 ] 

Branimir Lambov commented on CASSANDRA-8180:


I personally would prefer to not modify the behaviour of {{MergeIterator}} and 
keep it doing one simple thing, but this approach does have its charm.

An empty row will not work correctly as a lower bound. It does not sort as 
needed with respect to tombstone bounds, which should also be included in the 
test (more specifically, one that adds a row, flushes, deletes same row, 
flushes again, then checks if it resurfaces-- I believe this would break with 
the current code). Use a {{RangeTombstoneBound}} with {{DeletionTime.LIVE}} as 
the deletion time and a bound obtained by 
{{RangeTombstone.Bound.inclusiveOpen}}, which should do the right thing in both 
directions.

{{IMergeIterator.LowerBound}} is cryptic, rename it to 
{{IteratorWithLowerBound}} to be explicit about its purpose.

The choice to set {{rowIndexLowerBound}} in {{partitionLevelDeletion()}} 
appears very arbitrary and fragile. What is the reason to do it separately from 
{{globalLowerBound}}? In fact, why have two separate bounds instead of one, set 
from the most precise information that is available at construction time?

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-20 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108363#comment-15108363
 ] 

Branimir Lambov commented on CASSANDRA-8180:


{{updateClusteringValues()}} doesn't produce what you need, as it is not just 
{{TOP}} and {{BOTTOM}} that are mishandled, any incomplete prefix will also be. 
What you need is an actual {{min/maxClusteringPrefix}} which is different from 
the min/max of each component separately.

I don't know if it is possible to append the data you need at the end of the 
stats component and have earlier versions happily ignore that data.



> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>   

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-18 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106233#comment-15106233
 ] 

Stefania commented on CASSANDRA-8180:
-

Rebased and squashed [~blambov]:

||trunk||
|[patch|https://github.com/stef1927/cassandra/commits/8180]|
|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-8180-testall/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-8180-dtest/]|

There is one issue, {{RangeTombstone.Bound.TOP}} and {{BOTTOM}} are not 
currently updating the min and max clustering values of {{StatsMetadata}} since 
their clustering values are empty arrays, see 
{{MetadataCollector.updateClusteringValues()}} and {{BTW.applyToMarker()}}. As 
a result, [some range tombstone 
tests|http://cassci.datastax.com/job/stef1927-8180-testall/1/testReport/junit/junit/]
 are failing. It's not clear to me how to capture this information in the 
metadata without changing its format given that the smallest / biggest values 
would depend on the individual clustering types, cc [~slebresne] as well.


> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-13 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096458#comment-15096458
 ] 

Sylvain Lebresne commented on CASSANDRA-8180:
-

[~Stefania] Could you have a shot at rebasing this at some point, and 
[~blambov] to have another look at it once that's done. I'd hate for that code 
to get so outdated that its too much effort to rebase.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 | 60
>Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |   

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2016-01-13 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096476#comment-15096476
 ] 

Stefania commented on CASSANDRA-8180:
-

Will rebase and also try to see if we can leverage CASSANDRA-9975 to avoid 
wrapping iterators.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> 
>
> Key: CASSANDRA-8180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
> Environment: Cassandra 2.0.10
>Reporter: DOAN DuyHai
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |  0
> Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 | 74
>Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |253
>   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |930
>   Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |943
>Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |   1032
>Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |   1160
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1173
>Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |   1889
>Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |   1901
>Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |   2373
>Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |   2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |   2768
> Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |   2784
> Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |   2976
>   Request complete | 
> 23:48:46,501 | 127.0.0.1 |   3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity  | 
> timestamp| source| source_elapsed
> ---+--+---+
> execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |  0
>Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 | 60
>Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |277
>   Executing single-partition query on test | 
> 23:52:31,889 | 127.0.0.1 | 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-26 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642236#comment-14642236
 ] 

Stefania commented on CASSANDRA-8180:
-

[~blambov] I've pushed a new commit where I ensure that lower bounds always 
compare to less than real values, when the clustering is the same. This means 
the existing merge iterator algorithm is now almost unchanged, the only 
difference is that I moved consume() into the candidate, where we make sure the 
lower bounds are never consumed by the reducer. We still use empty rows as 
lower bounds, but they are never used outside of the merge iterator candidate. 
We could use a specialized {{Unfiltered}} if it really bothers you however. Can 
you take another look?

As for performance, with this test the 8180 branch is ahead:

{code}
user profile=https://dl.dropboxusercontent.com/u/15683245/8180.yaml 
ops\(insert=1,\) n=5M -rate threads=300 -insert revisit=uniform\(1..100\) 
visits=fixed\(25\) -pop seq=1..1K read-lookback=uniform\(1..1K\) contents=SORTED
{code}

http://cstar.datastax.com/graph?stats=094f57cc-3409-11e5-bd2b-42010af0688fmetric=op_rateoperation=1_usersmoothing=1show_aggregates=truexmin=0xmax=1609.08ymin=0ymax=4637.6

The read command is unchanged and performance is similar or maybe still better 
on trunk:

{code}
user profile=https://dl.dropboxusercontent.com/u/15683245/8180.yaml 
ops\(singleval=1,\) n=5M -rate threads=300
{code}

http://cstar.datastax.com/graph?stats=094f57cc-3409-11e5-bd2b-42010af0688fmetric=op_rateoperation=2_usersmoothing=1show_aggregates=truexmin=0xmax=35.53ymin=0ymax=188763.3

I think we are still visiting all sstables on the second command because the 
global bounds are probably the same.  Profile is attached as 8180_002.yaml.

I've also noticed with flight recorder that {{BigTableReader.getPosition()}} 
and {{Tracing.trace()}} are hotspots, both on trunk and on 8180, we should 
probably optimize them.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x

 Attachments: 8180_001.yaml, 8180_002.yaml


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-24 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640182#comment-14640182
 ] 

Branimir Lambov commented on CASSANDRA-8180:


Yes, I have started looking at it now.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query on test | 
 23:52:31,889 | 127.0.0.1 |961
   Acquiring sstable references | 
 23:52:31,889 | 127.0.0.1 |971
Merging memtable 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-24 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640206#comment-14640206
 ] 

Stefania commented on CASSANDRA-8180:
-

Thanks, make sure you have the latest code as I have rebased and fixed a couple 
of things today.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x

 Attachments: 8180.yaml


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query on test | 
 23:52:31,889 | 127.0.0.1 |961
   Acquiring sstable references | 
 23:52:31,889 | 127.0.0.1 | 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-24 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640223#comment-14640223
 ] 

Benedict commented on CASSANDRA-8180:
-

Support for that isn't fantastic, but instead of providing a select chance of 
0.001, you can use the {{-insert revisit visits contents}}, and {{-pop 
read-lookback}} options.

* {{visits}} tells stress to split inserts for a single partition up into 
multiple visits, performing a roughly equal number of inserts per visit, but 
ultimately inserting the entirety of the partition's contents
* {{revisits}} tells stress _how many_ of these split-up inserts to maintain at 
once, and in what distribution they should be revisited (i.e. exp(1..1M) would 
mean the most recent to be inserted would be exponentially more likely to be 
visited than the oldest; uniform(1..1M) would make them all uniformly likely; 
both would limit the number of in flight inserts to 1M)
* {{read-lookback}} tells reads to operate only over those partitions that have 
been inserted, and only their portion that has been inserted
* {{contents=SORTED}} sorts the partition data before insert. This is expensive 
if there are many items, so you should ensure no single clustering column 
generates more than a couple of dozen values. If necessary, introduce more 
clustering columns.

The main problem is that these features are not well (if at all) tested, so you 
may encounter problems.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x

 Attachments: 8180.yaml


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-24 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640202#comment-14640202
 ] 

Stefania commented on CASSANDRA-8180:
-

We definitely need a place to archive these stress profiles, attaching them to 
the tickets like I've done so far is not very efficient. CASSANDRA-8503 aims to 
collect important stress profiles for regression testing, perhaps we can 
continue the discussion there?

In terms of this specific performance optimization, so far things aren't going 
too well. The problem is that I don't know how to create a 'timeseries' profile 
with cassandra-stress, is this possible [~benedict]?  We basically need to have 
many clustering rows per partition and they must be ordered.

With the profile attached, at best I have been able to show that we are not 
worse:

http://cstar.datastax.com/tests/id/ac8c686c-31de-11e5-95c3-42010af0688f

However, when I increased the data from 1M to 5M and changed the primary key 
population distribution from 1..100K to 1B, I ended up with something worse:

http://cstar.datastax.com/tests/id/11dc0080-31dd-11e5-80f3-42010af0688f

So I think there is still some work to do. I have also some weirdness in the 
flight recorder profiles, which I have not attached due to their size but I can 
do if anyone is interested. I fixed two hot-spots today but there must be at 
least one more problem. The whole approach of a lazy wrapping iterator is a bit 
fragile IMO, all it takes is to call a method too soon and the entire 
optimization is lost, except the overhead of the wrapper iterator and the 
increased complexity in the merge iterator remains.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x

 Attachments: 8180.yaml


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-24 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640337#comment-14640337
 ] 

Branimir Lambov commented on CASSANDRA-8180:


I have some worries that relate to the use of a row as the lower bound: for one 
there is a small risk that having two equal values produced by an iterators 
could be mishandled by some existing or future code, and as a second I do not 
know what would happen if an sstable's content starts with a range tombstone, 
which sorts before rows with equal clustering. Additionally, I see no changes 
in {{MergeIterator.OneToOne}} or {{TrivialOneToOne}}, do they not need updates?

I believe the solution could be simpler and cleaner if the bound is given as a 
distinct special value that sorts before any content (similar to how range 
tombstone bounds sort). The existence of these special values needs only 
concern the arguments given to {{MergeIterator}}: the comparator, which should 
know to sort them before anything else with the same key (including tombstone 
bounds), as well as the reducer, which can then be certain that a bound is not 
going to be attempted to be combined with anything else, and can safely ignore 
it by perhaps returning null at {{getReduced}}; if necessary bounds can be 
easily made unequal to each other as well. {{MergeIterator}} will then pop 
bounds before content rather than with it, and will not need not be modified at 
all. 


 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x

 Attachments: 8180.yaml


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-24 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640475#comment-14640475
 ] 

Stefania commented on CASSANDRA-8180:
-

bq. I have some worries that relate to the use of a row as the lower bound: for 
one there is a small risk that having two equal values produced by an iterators 
could be mishandled by some existing or future code, and as a second I do not 
know what would happen if an sstable's content starts with a range tombstone, 
which sorts before rows with equal clustering. 

The idea is that these fake Clusterable instances never leave the merge 
iterators. I can create a sub-class of Clusterable for clarity but it's safe to 
use an empty row as it is only returned as a lower bound and not as an iterated 
value. Have you looked at the version of the code I pushed today, where the 
candidate takes the lower bound from the iterator? At the moment the reducer 
decides if it should ignore it by checking if the row is empty. This is wrong 
it should be the candidate that decides that it should be ignored and the next 
one returned instead, except this requires a new comparison. 

bq. Additionally, I see no changes in MergeIterator.OneToOne or 
TrivialOneToOne, do they not need updates?

No they don't need updating, they never use the lower bound.

bq. I believe the solution could be simpler and cleaner if the bound is given 
as a distinct special value that sorts before any content...

Yes this would solve the problem just mentioned above where we need to check if 
there is an equal real value immediately afterwards. I'll work on this idea, 
thank you very much!



 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x

 Attachments: 8180.yaml


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-22 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637037#comment-14637037
 ] 

Joshua McKenzie commented on CASSANDRA-8180:


Do we have a place where we're archiving these stress profiles? Seems like it 
would be helpful to have a profile paired with a small comment on what specific 
setup/portions of the system is stresses for future use. Also can socket those 
into the perf harness when that's ready.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-22 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636514#comment-14636514
 ] 

Sylvain Lebresne commented on CASSANDRA-8180:
-

bq. Sylvain Lebresne are you happy to still be the reviewer or do you want to 
suggest someone else?

I'm leaving on vacation at the end of the week and have enough on my plate 
until then that it's safe to assume I won't have time to look at this one. I'm 
happy to have a look when I'm back and have a tad more time, but it's probably 
a good idea to find another reviewer in the meantime in case that new reviewer 
has time to get to it sooner. I would suggest Branimir if he has some cycles 
since that ticket has a lot to do with {{MergeIterator}} and he has been 
dealing with that quite a bit lately.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-22 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636538#comment-14636538
 ] 

Stefania commented on CASSANDRA-8180:
-

Thanks. [~blambov] would you have time to review this or shall we wait for 
Sylvain's return?

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query on test | 
 23:52:31,889 | 127.0.0.1 |961
   Acquiring sstable references | 
 23:52:31,889 | 127.0.0.1 |971
  

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-22 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636477#comment-14636477
 ] 

Stefania commented on CASSANDRA-8180:
-

I have attached the latest patch based on trunk, which is ready for review.

The main things to note are:

* In MergeIterator we need to discard fake lower bound values and we must 
advance the same iterator to check that there isn't another real value with the 
exact same value, in which case we must pass it to the reducer together with 
the other equal values from other iterators. However, we cannot just discard 
fake values too soon, i.e. in ManyToOne.advance() we cannot just advance as 
long as we have fake values, we need to wait for the heap to be sorted first. I 
implemented a [peek 
method|https://github.com/stef1927/cassandra/commit/9528e672ea65c7a71c0004adfc27e5f4d9ee0acb#diff-a9e2c345aa605d1b8d360b4c44ade32f]
 in the candidate, this is called for fake values when reducing values. This 
should not affect the correctness of the algorithm but there may be a more 
efficient way to do this, cc [~benedict] and [~blambov]. 

* We need to decide if we are happy with a wrapping iterator, 
[LowerBoundUnfilteredRowIterator|https://github.com/stef1927/cassandra/commit/9528e672ea65c7a71c0004adfc27e5f4d9ee0acb#diff-1cdf42ebc69336015e04f287e8450e51],
 in which case we may need a better name. I understand that wrapping too many 
iterators may hurt performance so we may want to look into modifying 
AbstractSSTableIterator directly, even though this might be a bit more work. I 
also wrap the merged iterator to make sure we update the metrics of iterated 
tables in the close method, when we know how many tables were iterated. It 
would be nice to at least save this wrapped iterator.

* We need a reliable way to signal a fake Unfiltered object (the lower bound). 
I used an [empty 
row|https://github.com/stef1927/cassandra/commit/9528e672ea65c7a71c0004adfc27e5f4d9ee0acb#diff-3e7088b7213c9faaf80e18dadaaa6929]
 but perhaps we should use a new specialization for Unfiltered or something 
else.

[~slebresne] are you happy to still be the reviewer or do you want to suggest 
someone else?


 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-22 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636584#comment-14636584
 ] 

Stefania commented on CASSANDRA-8180:
-

Noted. I'll work on a suitable stress profile and then attach some flight 
recorder or cstar_perf comparisons.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query on test | 
 23:52:31,889 | 127.0.0.1 |961
   Acquiring sstable references | 
 23:52:31,889 | 127.0.0.1 |971
 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-07-22 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636566#comment-14636566
 ] 

Sylvain Lebresne commented on CASSANDRA-8180:
-

Side note: would be nice to have some benchmarks validating the benefits of 
this before we commit, since that's a performance improvement.

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.x


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query on test | 
 23:52:31,889 | 127.0.0.1 |961
   Acquiring sstable references | 
 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

2015-04-27 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516104#comment-14516104
 ] 

Stefania commented on CASSANDRA-8180:
-

Rebased to latest 8099 branch. 

Test results will be available here:

http://cassci.datastax.com/view/Dev/view/Stefania/job/stef1927-8180-8099-testall/
http://cassci.datastax.com/view/Dev/view/Stefania/job/stef1927-8180-8099-dtest/

 Optimize disk seek using min/max column name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.0


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query on test | 
 23:52:31,889 | 127.0.0.1 |961