Re: horizontal query scaling issues follow on

Jack Krupansky Thu, 17 Jul 2014 21:20:08 -0700

Sorry I may have confused the discussion by mentioning tokens – I wasn’t 
intending to refer to vnodes or the num_tokens property, but merely referring 
to the token range of a node and that the partition key hashes to a token value.

The main question is what you use for your primary key and whether you are 
using a small number of partition keys and a large number of clustering 
columns, or does each row have a unique partition key and no clustering columns.

-- Jack Krupansky

From: Diane Griffith 
Sent: Thursday, July 17, 2014 6:21 PM
To: user 
Subject: Re: horizontal query scaling issues follow on

So do partitions equate to tokens/vnodes? 

If so we had configured all cluster nodes/vms with num_tokens: 256 instead of 
setting init_token and assigning ranges.  I am still not getting why in 
Cassandra 2.0, I would assign my own ranges via init_token and this was based 
on the documentation and even this blog item that made it seem right for us to 
always configure our cluster vms with num_tokens: 256 in the cassandra.yaml 
file.  

Also in all testing, all vms were of equal sizing so one was not more powerful 
than another.  

I didn't think I was hitting an i/o wall on the client vm (separate vm) where 
we command line scripted our query call to the cassandra cluster.    I can 
break the client call load across vms which I tried early on.  Happy to verify 
that again though.

So given that I was assuming the partitions were such that it wasn't a problem. 
 Is that an incorrect assumption and something to dig into more?

Thanks,
Diane

On Thu, Jul 17, 2014 at 3:01 PM, Jack Krupansky <j...@basetechnology.com> wrote:

  How many partitions are you spreading those 18 million rows over? That many 
rows in a single partition will not be a sweet spot for Cassandra. It’s not 
exceeding any hard limit (2 billion), but some internal operations may cache 
the partition rather than the logical row.

  And all those rows in a single partition would certainly not be a test of 
“horizontal scaling” (adding nodes to handle more data – more token values or 
partitions.)

  -- Jack Krupansky

  From: Diane Griffith 
  Sent: Thursday, July 17, 2014 1:33 PM
  To: user 
  Subject: horizontal query scaling issues follow on

  This is a follow on re-post to clarify what we are trying to do, providing 
information that was missing or not clear.

  Goal:  Verify horizontal scaling for random non duplicating key reads using 
the simplest configuration (or minimal configuration) possible.

  Background:

  A couple years ago we did similar performance testing with Cassandra for both 
read and write performance and found excellent (essentially linear) horizontal 
scalability.  That project got put on hold.  We are now moving forward with an 
operational system and are having scaling problems.

  During the prior testing (3 years ago) we were using a much older version of 
Cassandra (0.8 or older), the THRIFT API, and Amazon AWS rather than OpenStack 
VMs.  We are now using the latest Cassandra and the CQL interface.  We did try 
moving from OpenStack to AWS/EC2 but that did not materially change our (poor) 
results.

  Test Procedure:

    a.. Inserted 54 million cells in 18 million rows (so 3 cells per row), 
using randomly generated row keys. That was to be our data control for the 
test. 
    b.. Spawn a client on a different VM to query 100k rows and do that for 100 
reps.  Each row key queried is drawn randomly from the set of existing row 
keys, and then not re-used, so all 10 million row queries use a different 
(valid) row key.  This test is a specific use case of our system we are trying 
to show will scale 
  Result:

    a.. 2 nodes performed better than 1 node test but 4 nodes showed decreased 
performance over 2 nodes.  So that did not show horizontal scaling 

  Notes:

    a.. We have replication factor set to 1 as we were trying to keep the 
control test simple to prove out horizontal scaling.  
    b.. When we tried to add threading to see if it would help it had 
interesting side behavior which did not prove out horizontal scaling. 
    c.. We are using CQL versus THRIFT API for Cassandra 2.0.6 

  Does anyone have any feedback that either threading or replication factor is 
necessary to show horizontal scaling of Cassandra versus the minimal way of 
just continue to add nodes to help throughput?

  Any suggestions of minimal configuration necessary to show scaling of our 
query use case 100k requests for random non repeating keys constantly coming in 
over a period of time?

  Thanks,

  Diane

Re: horizontal query scaling issues follow on

Reply via email to