J.B. Langston created CASSANDRA-7805:
----------------------------------------
Summary: Performance regression in multi-get (in clause) due to
automatic paging
Key: CASSANDRA-7805
URL: https://issues.apache.org/jira/browse/CASSANDRA-7805
Project: Cassandra
Issue Type: Bug
Reporter: J.B. Langston
Priority: Minor
Comparative benchmarking of 1.2 vs. 2.0 shows a regression in multi-get (in
clause) queries due to automatic paging. Take the following example:
select myId, col1, col2, col3 from myTable where col1 = 'xyz' and myId IN (id1,
id1, ..., id100); // primary key is (myId, col1)
We were suprised to see that in 2.0, the above query was giving an order of
magnitude worse performance than in 1.2. Digging in, I believe it is due to the
issue described in the comment at the top of MultiPartitionPager.java (v2.0.9):
"Note that this is not easy to make efficient. Indeed, we need to page the
first command fully before returning results from the next one, but if the
result returned by each command is small (compared to pageSize), paging the
commands one at a time under-performs compared to parallelizing."
The perf regression is due to the new paging feature in 2.0. The server is
executing the read for each id in the IN clause *sequentially* in order to
implement the paging semantics.
The wisdom of using multi-get like this has been debated in other forums, but
the thing that's unfortunate from a user point of view, is if they had a bunch
of code working against 1.2 and then they upgrade their cluster to 2.0 and all
of a sudden start to see an order of magnitude or worse perf regression. That
will be perceived as a problem. I think it would surprise anyone not familiar
with the code that the separate reads for the IN clause would be done
sequentially rather than in parallel.
As a workaround, disable paging in the Java driver by setting fetchSize to
Integer.MAX_VALUE on your QueryOptions
--
This message was sent by Atlassian JIRA
(v6.2#6252)