[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

Stefania (JIRA) Wed, 20 Jan 2016 19:18:07 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109971#comment-15109971
 ]


Stefania commented on CASSANDRA-8180:
-------------------------------------

Thank you for the review.

bq. updateClusteringValues() doesn't produce what you need, as it is not just 
TOP and BOTTOM that are mishandled, any incomplete prefix will also be. What 
you need is an actual min/maxClusteringPrefix which is different from the 
min/max of each component separately.

I agree that we need a full clustering prefix. What I don't understand is how 
things like {{shouldInclude()}} in {{ClusteringIndexNamesFilter}} or 
{{ClusteringIndexSliceFilter}} work. What is an example of any other incomplete 
prefix and do we have a gap in the tests then?

bq. I don't know if it is possible to append the data you need at the end of 
the stats component and have earlier versions happily ignore that data.

There are 4 components that we write in the same file: VALIDATION, STATS, 
COMPACTION and HEADER so we can't simply keep on reading till the end of the 
file. However we write a TOC with the position of each component. So it should 
be possible but it would require changes to 
{{MetadataSerializer.deserialize()}} and the signature of 
{{IMetadataComponentSerializer.deserialize()}}, which should receive the total 
size to work out if there is more stuff to read at the end. I guess we can go 
for it.

bq. I personally would prefer to not modify the behaviour of MergeIterator and 
keep it doing one simple thing, but this approach does have its charm.

The  changes to MergeIterator are mostly in the candidate and really minimal, 
the actual algorithm is unchanged. However if you do have a less invasive 
approach in mind I'm eager to hear it.

bq. An empty row will not work correctly as a lower bound. It does not sort as 
needed with respect to tombstone bounds, which should also be included in the 
test (more specifically, one that adds a row, flushes, deletes same row, 
flushes again, then checks if it resurfaces-- I believe this would break with 
the current code). 

Thanks, I'll add this test.

bq. Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a 
bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right 
thing in both directions.

I'm not sure what you mean, do this for the test or the fix? I'm sure I'll work 
it out when I write the test though.

bq. IMergeIterator.LowerBound is cryptic, rename it to IteratorWithLowerBound 
to be explicit about its purpose.

OK

bq. The choice to set rowIndexLowerBound in partitionLevelDeletion() appears 
very arbitrary and fragile. What is the reason to do it separately from 
globalLowerBound? In fact, why have two separate bounds instead of one, set 
from the most precise information that is available at construction time?

The global lower bound is free, since it is available in the metadata. The 
index lower bound is more accurate but it requires seeking the index file. 
Calling {{super.partitionLevelDeletion()}} also involves initializing the 
iterator and accessing the data file (AbstractSSTableIterator constructor). So 
we decided to use this more accurate bound only when we really have to access 
the index anyway, that is when partitionLevelDeletion() is called and there are 
tombstones. See [this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-8180?focusedCommentId=14388301&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14388301]
 above.

I hope to resume this in the next few days.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8180
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths
>         Environment: Cassandra 2.0.10
>            Reporter: DOAN DuyHai
>            Assignee: Stefania
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity                                                                  | 
> timestamp    | source    | source_elapsed
> ---------------------------------------------------------------------------+--------------+-----------+----------------
>                                                         execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |              0
>                             Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 |             74
>                                                        Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |            253
>                                   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |            930
>                                               Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |            943
>                                                Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |           1032
>                                                Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |           1160
>                                Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |           1173
>                                                Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |           1889
>                                Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |           1901
>                                                Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |           2373
>                                Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |           2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |           2768
>                                 Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |           2784
>                                         Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |           2976
>                                                           Request complete | 
> 23:48:46,501 | 127.0.0.1 |           3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity                                                                  | 
> timestamp    | source    | source_elapsed
> ---------------------------------------------------------------------------+--------------+-----------+----------------
>                                                         execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |              0
>                Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 |             60
>                                                        Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |            277
>                                   Executing single-partition query on test | 
> 23:52:31,889 | 127.0.0.1 |            961
>                                               Acquiring sstable references | 
> 23:52:31,889 | 127.0.0.1 |            971
>                                                Merging memtable tombstones | 
> 23:52:31,889 | 127.0.0.1 |           1020
>                                                Key cache hit for sstable 3 | 
> 23:52:31,889 | 127.0.0.1 |           1108
>                                Seeking to partition beginning in data file | 
> 23:52:31,889 | 127.0.0.1 |           1117
>  Skipped 2/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:52:31,889 | 127.0.0.1 |           1611
>                                 Merging data from memtables and 1 sstables | 
> 23:52:31,890 | 127.0.0.1 |           1624
>                                         Read 1 live and 0 tombstoned cells | 
> 23:52:31,890 | 127.0.0.1 |           1700
>                                                           Request complete | 
> 23:52:31,890 | 127.0.0.1 |           2140
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

Reply via email to