This is a follow on re-post to clarify what we are trying to do, providing
information that was missing or not clear.



Goal:  Verify horizontal scaling for random non duplicating key reads using
the simplest configuration (or minimal configuration) possible.



Background:

A couple years ago we did similar performance testing with Cassandra for
both read and write performance and found excellent (essentially linear)
horizontal scalability.  That project got put on hold.  We are now moving
forward with an operational system and are having scaling problems.



During the prior testing (3 years ago) we were using a much older version
of Cassandra (0.8 or older), the THRIFT API, and Amazon AWS rather than
OpenStack VMs.  We are now using the latest Cassandra and the CQL
interface.  We did try moving from OpenStack to AWS/EC2 but that did not
materially change our (poor) results.



Test Procedure:

   - Inserted 54 million cells in 18 million rows (so 3 cells per row),
   using randomly generated row keys. That was to be our data control for the
   test.
   - Spawn a client on a different VM to query 100k rows and do that for
   100 reps.  Each row key queried is drawn randomly from the set of existing
   row keys, and then not re-used, so all 10 million row queries use a
   different (valid) row key.  This test is a specific use case of our system
   we are trying to show will scale

Result:

   - 2 nodes performed better than 1 node test but 4 nodes showed decreased
   performance over 2 nodes.  So that did not show horizontal scaling



Notes:

   - We have replication factor set to 1 as we were trying to keep the
   control test simple to prove out horizontal scaling.
   - When we tried to add threading to see if it would help it had
   interesting side behavior which did not prove out horizontal scaling.
   - We are using CQL versus THRIFT API for Cassandra 2.0.6





Does anyone have any feedback that either threading or replication factor
is necessary to show horizontal scaling of Cassandra versus the minimal way
of just continue to add nodes to help throughput?



Any suggestions of minimal configuration necessary to show scaling of our
query use case 100k requests for random non repeating keys constantly
coming in over a period of time?


Thanks,

Diane

Reply via email to