[jira] [Created] (CASSANDRA-7805) Performance regression in multi-get (in clause) due to automatic paging

J.B. Langston (JIRA) Wed, 20 Aug 2014 12:54:37 -0700

J.B. Langston created CASSANDRA-7805:
----------------------------------------


             Summary: Performance regression in multi-get (in clause) due to 
automatic paging
                 Key: CASSANDRA-7805
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7805
             Project: Cassandra
          Issue Type: Bug
            Reporter: J.B. Langston
            Priority: Minor


Comparative benchmarking of 1.2 vs. 2.0 shows a regression in multi-get (in 
clause) queries due to automatic paging.  Take the following example:

select myId, col1, col2, col3 from myTable where col1 = 'xyz' and myId IN (id1, 
id1, ..., id100); // primary key is (myId, col1)

We were suprised to see that in 2.0, the above query was giving an order of 
magnitude worse performance than in 1.2. Digging in, I believe it is due to the 
issue described in the comment at the top of MultiPartitionPager.java (v2.0.9): 
"Note that this is not easy to make efficient. Indeed, we need to page the 
first command fully before returning results from the next one, but if the 
result returned by each command is small (compared to pageSize), paging the 
commands one at a time under-performs compared to parallelizing."

The perf regression is due to the new paging feature in 2.0. The server is 
executing the read for each id in the IN clause *sequentially* in order to 
implement the paging semantics.

The wisdom of using multi-get like this has been debated in other forums, but 
the thing that's unfortunate from a user point of view, is if they had a bunch 
of code working against 1.2 and then they upgrade their cluster to 2.0 and all 
of a sudden start to see an order of magnitude or worse perf regression. That 
will be perceived as a problem. I think it would surprise anyone not familiar 
with the code that the separate reads for the IN clause would be done 
sequentially rather than in parallel.

As a workaround, disable paging in the Java driver by setting fetchSize to 
Integer.MAX_VALUE on your QueryOptions



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (CASSANDRA-7805) Performance regression in multi-get (in clause) due to automatic paging

Reply via email to