[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114909#comment-15114909
 ] 

Stefania commented on CASSANDRA-8180:
-------------------------------------

Pushed a 
[commit|https://github.com/stef1927/cassandra/commit/d5cfc6fd56d50eda5d9c510591bae1d66e17ec59]
 where we don't use the lower bound in the presence of tombstones and 
{{DeleteTest}} is now passing. CI is still pending however. I would like to 
point out that {{iter.partitionLevelDeletion()}} is currently called for all 
sstables by {{queryMemtableAndDiskInternal()}} and therefore, in the presence 
of tombstones, we have to access the sstable anyway.

bq. Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, 
ck2) will generate one.

This case existed already in 
{{DeleteTest.testDeleteWithRangeAndTwoClusteringColumns()}}. It does not fail 
though, because the clustering comparator compares prefix values from first to 
last and so it works fine with incomplete prefixes.

bq. An empty row will not work correctly as a lower bound. It does not sort as 
needed with respect to tombstone bounds, which should also be included in the 
test (more specifically, one that adds a row, flushes, deletes same row, 
flushes again, then checks if it resurfaces-- I believe this would break with 
the current code). Use a RangeTombstoneBound with DeletionTime.LIVE as the 
deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which 
should do the right thing in both directions.

I don't think this matters any longer if we never replace a tombstone with a 
lower bound but, I could not reproduce any failures even when using a lower 
bound for tombstones (other than for one sided range tombstones). Here is what 
I tried:

- add a row, flush, delete same row, flush again => FINE (we don't even use 
merge iterator)
- add two rows in same partition, flush, delete one and flush, delete other one 
and flush, FINE again
- add three rows in same partition, flush, delete one and flush, delete other 
one and flush, FINE again

We can replace {{BTreeRow.emptyRow(ret)}} with {{new 
RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(),
 ret.getRawValues()), DeletionTime.LIVE)}} if there is still a valid reason and 
ideally a failing test would be useful. I will clean up {{lowerBound()}} and 
add some comments to it in the next round once we have reached a decision.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8180
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths
>         Environment: Cassandra 2.0.10
>            Reporter: DOAN DuyHai
>            Assignee: Stefania
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity                                                                  | 
> timestamp    | source    | source_elapsed
> ---------------------------------------------------------------------------+--------------+-----------+----------------
>                                                         execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |              0
>                             Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 |             74
>                                                        Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |            253
>                                   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |            930
>                                               Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |            943
>                                                Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |           1032
>                                                Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |           1160
>                                Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |           1173
>                                                Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |           1889
>                                Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |           1901
>                                                Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |           2373
>                                Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |           2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |           2768
>                                 Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |           2784
>                                         Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |           2976
>                                                           Request complete | 
> 23:48:46,501 | 127.0.0.1 |           3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity                                                                  | 
> timestamp    | source    | source_elapsed
> ---------------------------------------------------------------------------+--------------+-----------+----------------
>                                                         execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |              0
>                Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 |             60
>                                                        Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |            277
>                                   Executing single-partition query on test | 
> 23:52:31,889 | 127.0.0.1 |            961
>                                               Acquiring sstable references | 
> 23:52:31,889 | 127.0.0.1 |            971
>                                                Merging memtable tombstones | 
> 23:52:31,889 | 127.0.0.1 |           1020
>                                                Key cache hit for sstable 3 | 
> 23:52:31,889 | 127.0.0.1 |           1108
>                                Seeking to partition beginning in data file | 
> 23:52:31,889 | 127.0.0.1 |           1117
>  Skipped 2/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:52:31,889 | 127.0.0.1 |           1611
>                                 Merging data from memtables and 1 sstables | 
> 23:52:31,890 | 127.0.0.1 |           1624
>                                         Read 1 live and 0 tombstoned cells | 
> 23:52:31,890 | 127.0.0.1 |           1700
>                                                           Request complete | 
> 23:52:31,890 | 127.0.0.1 |           2140
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to