[jira] [Commented] (CASSANDRA-6587) Slow query when using token range and secondary index

Otto Chrons (JIRA) Thu, 13 Mar 2014 06:20:21 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13933223#comment-13933223
 ]


Otto Chrons commented on CASSANDRA-6587:
----------------------------------------

Might be also related to vnodes (3 node cluster, vnodes=256). The same problem 
does not surface on a single node version.

While the index is being created, a query to fetch data based on the secondary 
index is fast, but after the index is complete it gets really slow. Like 5 
minutes to query 8700 rows from a set of 107k rows. Reading all the 107k rows 
(not using secondary index) takes only a dozen seconds, so something is 
seriously wrong here. And this was without {{LIMIT}} or any token-functions.

The dataset is only about 9MB and the index is about 4.7MB, so this is not an 
I/O issue in anyway.



> Slow query when using token range and secondary index
> -----------------------------------------------------
>
>                 Key: CASSANDRA-6587
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6587
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jan Chochol
>
> We are using token ranges to simulate pagination on external API. To achieve 
> this, we use similar queries:
> {noformat}
> SELECT * FROM table WHERE TOKEN(partition_key) > TOKEN('offset') AND 
> secondary_key = 'value' LIMIT 1000;
> {noformat}
> We found that such statement is quite ineffective, and we do not know how to 
> solve it.
> Let's try some example.
> You can fill Cassandra with folowing script:
> {noformat}
> perl -e "print(\"DROP KEYSPACE t;\nCREATE KEYSPACE t WITH replication = 
> {'class': 'SimpleStrategy', 'replication_factor' : 1};\nuse t;\nCREATE TABLE 
> t (a varchar PRIMARY KEY, b varchar, c varchar, d varchar);\nCREATE INDEX t_b 
> ON t (b);\nCREATE INDEX t_c ON t (c);\nCREATE INDEX t_d ON t (d);\n\");\$max 
> = 100000; for(\$i = 0; \$i < \$max; \$i++) { \$j = int(\$i * 10 / \$max); \$k 
> = int(\$i * 100 / \$max); print(\"INSERT INTO t (a, b, c, d) VALUES ('a\$i', 
> 'b\$j', 'c\$k', 'd\$i');\n\")}; for(\$i = 0; \$i < \$max; \$i++) { 
> print(\"INSERT INTO t (a, b, c, d) VALUES ('e\$i', 'f\$j', 'g\$k', 
> 'h\$i');\n\")}" | cqlsh
> {noformat}
> First we looked for last but one parition key:
> {noformat}
> [root@jch3-devel:~/c4] echo "SELECT a FROM t.t WHERE b = 'b1' LIMIT 100000;" 
> | cqlsh | tail
>  a18283
>  a11336
>  a14712
>  a11476
>  a19396
>  a14269
>  a10719
>  a14521
>  a13934
> {noformat}
> Than we issue following commands for some interesting behaviour:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 
> 10000000;
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 10;
> SELECT a, d FROM t.t WHERE b = 'b1' AND a = 'a14521' LIMIT 10;
> {noformat}
> And here is result:
> {noformat}
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND 
> TOKEN(a) = TOKEN('a14521') LIMIT 10000000;" | cqlsh
>  a      | d
> --------+--------
>  a14521 | d14521
> real    0m0.647s
> user    0m0.307s
> sys     0m0.076s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND 
> TOKEN(a) = TOKEN('a14521') LIMIT 10;" | cqlsh
>  a      | d
> --------+--------
>  a14521 | d14521
> real    0m16.454s
> user    0m0.341s
> sys     0m0.090s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND a = 
> 'a14521' LIMIT 10;" | cqlsh
>  a      | d
> --------+--------
>  a14521 | d14521
> real    0m0.404s
> user    0m0.309s
> sys     0m0.071s
> {noformat}
> Problem with {{LIMIT}} is described in CASSANDRA-6348, and is quite funny - 
> lower the limit, slower the requst (and with different structure of data it 
> can be even worse).
> This query is quite silly in reality (asking with secondary key, when you 
> have primary key), but is close as possible to our use case:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 10;
> {noformat}
> But we simply can not do:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 10;
> {noformat}
> As this is unsupported.
> {{CompositesSearcher.java}} gives us some clue about the problem:
> {noformat}
>         /*
>          * XXX: If the range requested is a token range, we'll have to start 
> at the beginning (and stop at the end) of
>          * the indexed row unfortunately (which will be inefficient), because 
> we have not way to intuit the small
>          * possible key having a given token. A fix would be to actually 
> store the token along the key in the
>          * indexed row.
>          */
> {noformat}
> Index row contains parition keys in partion key ordering (ordering exposed in 
> CQL3 as {{TOKEN(partition_key)}}), so these two request are expected to 
> return same values:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 1;
> SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 1;
> {noformat}
> But the second is not supported.
> Currently we are considering to go to our production with this patch:
> {noformat}
> diff --git 
> a/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java 
> b/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> index 44a1e64..0228c3a 100644
> --- a/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> +++ b/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> @@ -1123,8 +1123,10 @@ public class SelectStatement implements CQLStatement
>                          stmt.keyIsInRelation = true;
>                      }
>                  }
> -                else
> +                else if (stmt.metadataRestrictions.isEmpty())
>                  {
> +                    // This is poor man heuristic, and probably far from 
> correct, but it allows us these requests:
> +                    // SELECT * FROM table WHERE partition_key > 'offset' 
> AND secondary_key = 'value'
>                      throw new InvalidRequestException("Only EQ and IN 
> relation are supported on the partition key (you will need to use the token() 
> function for non equality based relation)");
>                  }
>                  previous = cname;
> diff --git 
> a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java 
> b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> index 5ab1df6..2ba2845 100644
> --- 
> a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> +++ 
> b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> @@ -190,7 +190,8 @@ public class CompositesSearcher extends 
> SecondaryIndexSearcher
>  
>              private int meanColumns = 
> Math.max(index.getIndexCfs().getMeanColumns(), 1);
>              // We shouldn't fetch only 1 row as this provides buggy paging 
> in case the first row doesn't satisfy all clauses
> -            private final int rowsPerQuery = 
> Math.max(Math.min(filter.maxRows(), filter.maxColumns() / meanColumns), 2);
> +            // We  are not too interested in memory consumption, as we are 
> using only "small data"
> +            private final int rowsPerQuery = 
> Math.max(Math.min(filter.maxRows(), filter.maxColumns()), 2);
>  
>              public boolean needsFiltering()
>              {
> {noformat}
> It is probably far from correct, but at least it gives us acceptable 
> performance:
> {noformat}
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND 
> TOKEN(a) > TOKEN('a14521') LIMIT 1;" | cqlsh
>  a      | d
> --------+--------
>  a13934 | d13934
> real    0m15.359s
> user    0m0.313s
> sys     0m0.090s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND a > 
> 'a14521' LIMIT 1;" | cqlsh
>  a      | d
> --------+--------
>  a13934 | d13934
> real    0m0.449s
> user    0m0.323s
> sys     0m0.106s
> {noformat}
> We do not think that {{TOKEN}} range requests with secondary indexes should 
> be so much slower.
> Can you please exentend CQL3 with possibility to parition keys range requests 
> with secondary indexes (than change this issue to feature request), or make 
> {{TOKEN}} range requests with secondary indexes faster?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-6587) Slow query when using token range and secondary index

Reply via email to