[
https://issues.apache.org/jira/browse/CASSANDRA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksey Yeschenko updated CASSANDRA-10050:
------------------------------------------
Issue Type: Improvement (was: Bug)
> Secondary Index Performance Dependent on TokenRange Searched in Analytics
> -------------------------------------------------------------------------
>
> Key: CASSANDRA-10050
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10050
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Environment: Single node, macbook, 2.1.8
> Reporter: Russell Alexander Spitzer
> Fix For: 3.x
>
>
> In doing some test work on the Spark Cassandra Connector I saw some odd
> performance when pushing down range queries with Secondary Index filters.
> When running the queries we see huge amount of time when the C* server is not
> doing any work and the query seem to be hanging. This investigation led to
> the work in this document
> https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0
> The Spark Cassandra Connector builds up token range specific queries and
> allows the user to pushdown relevant fields to C*. Here we have two indexed
> fields (size) and (color) being pushed down to C*.
> {code}
> SELECT count(*) FROM ks.tab WHERE token("store") > $min AND token("store") <=
> $max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}
> These queries will have different token ranges inserted and executed as
> separate spark tasks. Spark tasks with token ranges near the Min(token) end
> up executing much faster than those near Max(token) which also happen to
> through errors.
> {code}
> Coordinator node timed out waiting for replica nodes' responses]
> message="Operation timed out - received only 0 responses."
> info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
> {code}
> I took the queries and ran them through CQLSH to see the difference in time.
> A linear relationship is seen based on where the tokenRange being queried is
> starting with only 2 second for queries near the beginning of the full token
> spectrum and over 12 seconds at the end of the spectrum.
> The question is, can this behavior be improved? or should we not recommend
> using secondary indexes with Analytics workloads?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)