[
https://issues.apache.org/jira/browse/CASSANDRA-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932952#comment-13932952
]
Otto Chrons commented on CASSANDRA-6587:
----------------------------------------
We are experiencing the same problem with Cassandra 2.0.6
Funnily enough, recreating (DROP / CREATE) the index helped with the
performance where rebuilding the index nor repairing the cf using nodetool
didn't make any difference.
> Slow query when using token range and secondary index
> -----------------------------------------------------
>
> Key: CASSANDRA-6587
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6587
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Jan Chochol
>
> We are using token ranges to simulate pagination on external API. To achieve
> this, we use similar queries:
> {noformat}
> SELECT * FROM table WHERE TOKEN(partition_key) > TOKEN('offset') AND
> secondary_key = 'value' LIMIT 1000;
> {noformat}
> We found that such statement is quite ineffective, and we do not know how to
> solve it.
> Let's try some example.
> You can fill Cassandra with folowing script:
> {noformat}
> perl -e "print(\"DROP KEYSPACE t;\nCREATE KEYSPACE t WITH replication =
> {'class': 'SimpleStrategy', 'replication_factor' : 1};\nuse t;\nCREATE TABLE
> t (a varchar PRIMARY KEY, b varchar, c varchar, d varchar);\nCREATE INDEX t_b
> ON t (b);\nCREATE INDEX t_c ON t (c);\nCREATE INDEX t_d ON t (d);\n\");\$max
> = 100000; for(\$i = 0; \$i < \$max; \$i++) { \$j = int(\$i * 10 / \$max); \$k
> = int(\$i * 100 / \$max); print(\"INSERT INTO t (a, b, c, d) VALUES ('a\$i',
> 'b\$j', 'c\$k', 'd\$i');\n\")}; for(\$i = 0; \$i < \$max; \$i++) {
> print(\"INSERT INTO t (a, b, c, d) VALUES ('e\$i', 'f\$j', 'g\$k',
> 'h\$i');\n\")}" | cqlsh
> {noformat}
> First we looked for last but one parition key:
> {noformat}
> [root@jch3-devel:~/c4] echo "SELECT a FROM t.t WHERE b = 'b1' LIMIT 100000;"
> | cqlsh | tail
> a18283
> a11336
> a14712
> a11476
> a19396
> a14269
> a10719
> a14521
> a13934
> {noformat}
> Than we issue following commands for some interesting behaviour:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT
> 10000000;
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) = TOKEN('a14521') LIMIT 10;
> SELECT a, d FROM t.t WHERE b = 'b1' AND a = 'a14521' LIMIT 10;
> {noformat}
> And here is result:
> {noformat}
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND
> TOKEN(a) = TOKEN('a14521') LIMIT 10000000;" | cqlsh
> a | d
> --------+--------
> a14521 | d14521
> real 0m0.647s
> user 0m0.307s
> sys 0m0.076s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND
> TOKEN(a) = TOKEN('a14521') LIMIT 10;" | cqlsh
> a | d
> --------+--------
> a14521 | d14521
> real 0m16.454s
> user 0m0.341s
> sys 0m0.090s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND a =
> 'a14521' LIMIT 10;" | cqlsh
> a | d
> --------+--------
> a14521 | d14521
> real 0m0.404s
> user 0m0.309s
> sys 0m0.071s
> {noformat}
> Problem with {{LIMIT}} is described in CASSANDRA-6348, and is quite funny -
> lower the limit, slower the requst (and with different structure of data it
> can be even worse).
> This query is quite silly in reality (asking with secondary key, when you
> have primary key), but is close as possible to our use case:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 10;
> {noformat}
> But we simply can not do:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 10;
> {noformat}
> As this is unsupported.
> {{CompositesSearcher.java}} gives us some clue about the problem:
> {noformat}
> /*
> * XXX: If the range requested is a token range, we'll have to start
> at the beginning (and stop at the end) of
> * the indexed row unfortunately (which will be inefficient), because
> we have not way to intuit the small
> * possible key having a given token. A fix would be to actually
> store the token along the key in the
> * indexed row.
> */
> {noformat}
> Index row contains parition keys in partion key ordering (ordering exposed in
> CQL3 as {{TOKEN(partition_key)}}), so these two request are expected to
> return same values:
> {noformat}
> SELECT a, d FROM t.t WHERE b = 'b1' AND TOKEN(a) > TOKEN('a14521') LIMIT 1;
> SELECT a, d FROM t.t WHERE b = 'b1' AND a > 'a14521' LIMIT 1;
> {noformat}
> But the second is not supported.
> Currently we are considering to go to our production with this patch:
> {noformat}
> diff --git
> a/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> b/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> index 44a1e64..0228c3a 100644
> --- a/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> +++ b/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
> @@ -1123,8 +1123,10 @@ public class SelectStatement implements CQLStatement
> stmt.keyIsInRelation = true;
> }
> }
> - else
> + else if (stmt.metadataRestrictions.isEmpty())
> {
> + // This is poor man heuristic, and probably far from
> correct, but it allows us these requests:
> + // SELECT * FROM table WHERE partition_key > 'offset'
> AND secondary_key = 'value'
> throw new InvalidRequestException("Only EQ and IN
> relation are supported on the partition key (you will need to use the token()
> function for non equality based relation)");
> }
> previous = cname;
> diff --git
> a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> index 5ab1df6..2ba2845 100644
> ---
> a/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> +++
> b/src/java/org/apache/cassandra/db/index/composites/CompositesSearcher.java
> @@ -190,7 +190,8 @@ public class CompositesSearcher extends
> SecondaryIndexSearcher
>
> private int meanColumns =
> Math.max(index.getIndexCfs().getMeanColumns(), 1);
> // We shouldn't fetch only 1 row as this provides buggy paging
> in case the first row doesn't satisfy all clauses
> - private final int rowsPerQuery =
> Math.max(Math.min(filter.maxRows(), filter.maxColumns() / meanColumns), 2);
> + // We are not too interested in memory consumption, as we are
> using only "small data"
> + private final int rowsPerQuery =
> Math.max(Math.min(filter.maxRows(), filter.maxColumns()), 2);
>
> public boolean needsFiltering()
> {
> {noformat}
> It is probably far from correct, but at least it gives us acceptable
> performance:
> {noformat}
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND
> TOKEN(a) > TOKEN('a14521') LIMIT 1;" | cqlsh
> a | d
> --------+--------
> a13934 | d13934
> real 0m15.359s
> user 0m0.313s
> sys 0m0.090s
> [root@jch3-devel:~/c4] time echo "SELECT a, d FROM t.t WHERE b = 'b1' AND a >
> 'a14521' LIMIT 1;" | cqlsh
> a | d
> --------+--------
> a13934 | d13934
> real 0m0.449s
> user 0m0.323s
> sys 0m0.106s
> {noformat}
> We do not think that {{TOKEN}} range requests with secondary indexes should
> be so much slower.
> Can you please exentend CQL3 with possibility to parition keys range requests
> with secondary indexes (than change this issue to feature request), or make
> {{TOKEN}} range requests with secondary indexes faster?
--
This message was sent by Atlassian JIRA
(v6.2#6252)