Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-08 Thread 孔嘉林
Thanks Chris.
I run a *client on a separate* AWS *instance from* the Cassandra cluster
servers. At the client side, I create 40 or 50 threads for sending requests
to each Cassandra node. I create one thrift client for each of the threads.
And at the beginning, all the created thrift clients connect to the
corresponding Cassandra nodes and keep connecting during the whole
process(I did not close all the transports until the end of the test
process). So I use very simple load balancing, since the same number of
thrift clients connect to each node. And my source code is here:
https://github.com/kongjialin/Cassandra/blob/master/cassandra_client.cpp It's
very nice of you to help me improve my code.

As I increase the number of threads, the latency gets longer.

I'm using C++, so if I want to use native binary + prepared statements, the
only way is to use C++ driver?
Thanks very much.




2014-12-08 12:51 GMT+08:00 Chris Lohfink clohfin...@gmail.com:

 I think your client could use improvements.  How many threads do you have
 running in your test?  With a thrift call like that you only can do one
 request at a time per connection.   For example, assuming C* takes 0ms, a
 10ms network latency/driver overhead will mean 20ms RTT and a max
 throughput of ~50 QPS per thread (native binary doesn't behave like this).
 Are you running client on its own system or shared with a node?  how are
 you load balancing your requests?  Source code would help since theres a
 lot that can become a bottleneck.

 Generally you will see a bit of a dip in latency from N=RF=1 and N=2, RF=2
 etc since there are optimizations on the coordinator node when it doesn't
 need to send the request to the replicas.  The impact of the network
 overhead decreases in significance as cluster grows.  Typically; latency
 wise, RF=N=1 is going to be fastest possible for smaller loads (ie when a
 client cannot fully saturate a single node).

 Main thing to expect is that latency will plateau and remain fairly
 constant as load/nodes increase while throughput potential will linearly
 (empirically at least) increase.

 You should really attempt it with the native binary + prepared statements,
 running cql over thrift is far from optimal.  I would recommend using the
 cassandra-stress tool if you want to stress test Cassandra (and not your
 code)
 http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

 ===
 Chris Lohfink

 On Sun, Dec 7, 2014 at 9:48 PM, 孔嘉林 kongjiali...@gmail.com wrote:

 Hi Eric,
 Thank you very much for your reply!
 Do you mean that I should clear my table after each run? Indeed, I can
 see several times of compaction during my test, but could only a few times
 compaction affect the performance that much? Also, I can see from the
 OpsCenter some ParNew GC happen but no CMS GC happen.

 I run my test on EC2 cluster, I think the network could be of high speed
 with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD
 storage, which is of m3.xlarge type.

 As for latency, which latency should I care about most? p(99) or p(999)?
 I want to get the max QPS under a certain limited latency.

 I know my testing scenario are not the common case in production, I just
 want to know how much burden my cluster can bear under stress.

 So, how did you test your cluster that can get 86k writes/sec? How many
 requests did you send to your cluster? Was it also 1 million? Did you also
 use OpsCenter to monitor the real time performance? I also wonder why the
 write and read QPS OpsCenter provide are much lower than what I calculate.
 Could you please describe in detail about your test deployment?

 Thank you very much,
 Joy

 2014-12-07 23:55 GMT+08:00 Eric Stevens migh...@gmail.com:

 Hi Joy,

 Are you resetting your data after each test run?  I wonder if your tests
 are actually causing you to fall behind on data grooming tasks such as
 compaction, and so performance suffers for your later tests.

 There are *so many* factors which can affect performance, without
 reviewing test methodology in great detail, it's really hard to say whether
 there are flaws which might uncover an antipattern cause atypical number of
 cache hits or misses, and so forth. You may also be producing gc pressure
 in the write path, and so forth.

 I *can* say that 28k writes per second looks just a little low, but it
 depends a lot on your network, hardware, and write patterns (eg, data
 size).  For a little performance test suite I wrote, with parallel batched
 writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
 second.

 Also focusing exclusively on max latency is going to cause you some
 troubles especially in the case of magnetic media as you're using.  Between
 ill-timed GC and inconsistent performance characteristics from magnetic
 media, your max numbers will often look significantly worse than your p(99)
 or p(999) numbers.

 All this said, one node will often look better than several nodes 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-08 Thread Chris Lohfink
So I would -expect- an increase of ~20k qps per node with m3.xlarge so
there may be something up with your client (I am not a c++ person however
but hopefully someone on list will take notice).

Latency does not decreases linearly as you add nodes.  What you are likely
seeing with latency since so few nodes is side effect of an optimization.
When you read/write from a table the node you request will act as the
coordinator.  If the data exists on the coordinator and using rf=1 or cl=1
it will not have to send the request to another node, just service it
locally:

  +-+ +--+
  |  node0  | +--|node1 |
  |-| |--|
  |  client | --+| coordinator  |
  +-+ +--+

In this case the write latency is dominated by the network between
coordinator and client.  A second case is where the coordinator actually
has to send the request to another node:

  +-+ +--+ +---+
  |  node0  | +--|node1 |+-- |node2  |
  |-| |--| |---|
  |  client | --+| coordinator  |---+| data replica  |
  +-+ +--+ +---+

As your adding nodes your increasing the probability of hitting this second
scenario where the coordinator has to make an additional network hop.  This
possibly why your seeing an increase (aside from client issues). To get an
idea on how the latency is affected when you increase nodes you really need
to go higher then 4 (ie graph the same rf for 5, 10, 15, 25 nodes.  below 5
isn't really the recommended way to run Cassandra anyway) nodes since the
latency will approach that of the 2nd scenario (plus some spike outliers
for GCs) and then it should settle down until you overwork the node.

May want to give https://github.com/datastax/cpp-driver a go (not cpp guy
take with grain of salt).  I would still highly recommend using
cassandra-stress instead of own stuff if you want to test cassandra and not
your code.

===
Chris Lohfink

On Mon, Dec 8, 2014 at 4:57 AM, 孔嘉林 kongjiali...@gmail.com wrote:

 Thanks Chris.
 I run a *client on a separate* AWS *instance from* the Cassandra cluster
 servers. At the client side, I create 40 or 50 threads for sending requests
 to each Cassandra node. I create one thrift client for each of the threads.
 And at the beginning, all the created thrift clients connect to the
 corresponding Cassandra nodes and keep connecting during the whole
 process(I did not close all the transports until the end of the test
 process). So I use very simple load balancing, since the same number of
 thrift clients connect to each node. And my source code is here:
 https://github.com/kongjialin/Cassandra/blob/master/cassandra_client.cpp It's
 very nice of you to help me improve my code.

 As I increase the number of threads, the latency gets longer.

 I'm using C++, so if I want to use native binary + prepared statements,
 the only way is to use C++ driver?
 Thanks very much.




 2014-12-08 12:51 GMT+08:00 Chris Lohfink clohfin...@gmail.com:

 I think your client could use improvements.  How many threads do you have
 running in your test?  With a thrift call like that you only can do one
 request at a time per connection.   For example, assuming C* takes 0ms, a
 10ms network latency/driver overhead will mean 20ms RTT and a max
 throughput of ~50 QPS per thread (native binary doesn't behave like this).
 Are you running client on its own system or shared with a node?  how are
 you load balancing your requests?  Source code would help since theres a
 lot that can become a bottleneck.

 Generally you will see a bit of a dip in latency from N=RF=1 and N=2,
 RF=2 etc since there are optimizations on the coordinator node when it
 doesn't need to send the request to the replicas.  The impact of the
 network overhead decreases in significance as cluster grows.  Typically;
 latency wise, RF=N=1 is going to be fastest possible for smaller loads (ie
 when a client cannot fully saturate a single node).

 Main thing to expect is that latency will plateau and remain fairly
 constant as load/nodes increase while throughput potential will linearly
 (empirically at least) increase.

 You should really attempt it with the native binary + prepared
 statements, running cql over thrift is far from optimal.  I would recommend
 using the cassandra-stress tool if you want to stress test Cassandra (and
 not your code)
 http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

 ===
 Chris Lohfink

 On Sun, Dec 7, 2014 at 9:48 PM, 孔嘉林 kongjiali...@gmail.com wrote:

 Hi Eric,
 Thank you very much for your reply!
 Do you mean that I should clear my table after each run? Indeed, I can
 see several times of compaction during my test, but could only a few times
 compaction 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-08 Thread Eric Stevens
 Do you mean that I should clear my table after each run? Indeed, I can
see several times of compaction during my test, but could only a few times
compaction affect the performance that much?

It certainly affects performance.  Read performance suffers first, then
write performance suffers eventually.  For this synthetic test, if you want
to compare like states then you should certainly wipe between.  You may
fall behind on compaction for the first run, then the second run pays the
penalty for data grooming backlog generated during the first run.

 As for latency, which latency should I care about most? p(99) or p(999)?

p(99) discards the worst 1% of results for reporting, p(999) discards the
worst 0.1% of results for reporting.  Which you prefer depends on your
tolerance for response time jitter.  I.E. do you need 99% of responses to
be under a threshold, 99.9%?  The more 9's, the more likely you are to fail
your threshold due to an outlier.

 So, how did you test your cluster that can get 86k writes/sec? How many
requests did you send to your cluster?

I wrote the same data to each of 5 tables with similar columns, but
different key configurations.  I did 100 runs of 5,000 records (different
records for each run).  The data itself was 5 columns composed of a mix of
bigint, text, and timestamp (so per record, fairly small data).  I wrote
records in asynchronous batches of 100 at a time, completing each of the
5,000 records for one table before moving on to the next table (the last
write to table 1 needed to complete before I moved on to the first write of
table 2, but within a table the operations were done in parallel).

I used the Datastax Java Driver, which speaks the native protocol, and is
faster and supports more parallelism than Thrift.

 Was it also 1 million?

In total it was 500,000 records written to each of 5 tables - so 2.5
million records overall.

 Did you also use OpsCenter to monitor the real time performance? I also
wonder why the write and read QPS OpsCenter provide are much lower than
what I calculate.

No, I measured throughput on my client only.  I don't have much experience
with OpsCenter, so I'm afraid I can't give you much insight into why you'd
see inconsistent information compared to data you measured.  Maybe you're
just seeing information for a single node instead of the whole cluster?

Again, the validity of this kind of test is highly suspect even though I
happened to have set this up already.  In my case I was trying to measure
burst performance specifically.  Cassandra will definitely accept bursts
well, but if you sustain such a load, performance will degrade over time.
Under sustained conditions you need to be certain you are staying on top of
compaction - outstanding compaction tasks should rarely if ever exceed 2 or
3.  Above 10, you need to reduce your write volume or your cluster will
gradually fall over, and you'll struggle to bootstrap new nodes to expand.

Do not size Cassandra for burst writes, size it for sustained writes.
Write your sizing tests with that in mind - how much can you write and not
fall behind on compaction over time, and accordingly your tests need to run
for hours or days, not seconds or minutes.

On Mon Dec 08 2014 at 3:58:35 AM 孔嘉林 kongjiali...@gmail.com wrote:

 Thanks Chris.
 I run a *client on a separate* AWS *instance from* the Cassandra cluster
 servers. At the client side, I create 40 or 50 threads for sending requests
 to each Cassandra node. I create one thrift client for each of the threads.
 And at the beginning, all the created thrift clients connect to the
 corresponding Cassandra nodes and keep connecting during the whole
 process(I did not close all the transports until the end of the test
 process). So I use very simple load balancing, since the same number of
 thrift clients connect to each node. And my source code is here:
 https://github.com/kongjialin/Cassandra/blob/master/cassandra_client.cpp It's
 very nice of you to help me improve my code.

 As I increase the number of threads, the latency gets longer.

 I'm using C++, so if I want to use native binary + prepared statements,
 the only way is to use C++ driver?
 Thanks very much.




 2014-12-08 12:51 GMT+08:00 Chris Lohfink clohfin...@gmail.com:

 I think your client could use improvements.  How many threads do you have
 running in your test?  With a thrift call like that you only can do one
 request at a time per connection.   For example, assuming C* takes 0ms, a
 10ms network latency/driver overhead will mean 20ms RTT and a max
 throughput of ~50 QPS per thread (native binary doesn't behave like this).
 Are you running client on its own system or shared with a node?  how are
 you load balancing your requests?  Source code would help since theres a
 lot that can become a bottleneck.

 Generally you will see a bit of a dip in latency from N=RF=1 and N=2,
 RF=2 etc since there are optimizations on the coordinator node when it
 doesn't need to send the request to the 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Eric Stevens
Hi Joy,

Are you resetting your data after each test run?  I wonder if your tests
are actually causing you to fall behind on data grooming tasks such as
compaction, and so performance suffers for your later tests.

There are *so many* factors which can affect performance, without reviewing
test methodology in great detail, it's really hard to say whether there are
flaws which might uncover an antipattern cause atypical number of cache
hits or misses, and so forth. You may also be producing gc pressure in the
write path, and so forth.

I *can* say that 28k writes per second looks just a little low, but it
depends a lot on your network, hardware, and write patterns (eg, data
size).  For a little performance test suite I wrote, with parallel batched
writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
second.

Also focusing exclusively on max latency is going to cause you some
troubles especially in the case of magnetic media as you're using.  Between
ill-timed GC and inconsistent performance characteristics from magnetic
media, your max numbers will often look significantly worse than your p(99)
or p(999) numbers.

All this said, one node will often look better than several nodes for
certain patterns because it completely eliminates proxy (coordinator) write
times.  All writes are local writes.  It's an over-simple case that doesn't
reflect any practical production use of Cassandra, so it's probably not
worth even including in your tests.  I would recommend start at 3 nodes
rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
compaction and aren't seeing garbage collections in the logs (either of
those will be polluting your results with variability you can't account for
with small sample sizes of ~1 million).

If you expect to sustain write volumes like this, you'll find these
clusters are sized too small (on that hardware you won't keep up with
compaction), and your tests are again testing scenarios you wouldn't
actually see in production.

On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test tool, but use my own stress-test client code
 instead(I write some C++ stress test code). My Cassandra cluster is
 deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
 instances ) in the Datastax document, and I am not using EBS, just using
 the ephemeral storage by default. The EC2 type of Cassandra servers are
 m3.xlarge. I use another EC2 instance for my stress test client, which is
 of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
 node are in us-east. I test the Cassandra cluster which is made up of 1
 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
 test separately, but the performance doesn’t get linear increment when new
 nodes are added. Also I get some weird results. My test results are as
 follows(*I do 1 million operations and I try to get the best QPS when the
 max latency is no more than 200ms, and the latencies are measured from the
 client side. The QPS is calculated by total_operations/total_time).*



 *INSERT(write):*

 Node count

 Replication factor

   QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 18687

 2.08

 1.48

 2.95

 5.74

 52.8

 205.4

 2

 1

 20793

 3.15

 0.84

 7.71

 41.35

 88.7

 232.7

 2

 2

 22498

 3.37

 0.86

 6.04

 36.1

 221.5

 649.3

 4

 1

 28348

 4.38

 0.85

 8.19

 64.51

 169.4

 251.9

 4

 3

 28631

 5.22

 0.87

 18.68

 68.35

 167.2

 288



 *SELECT(read):*

 Node count

 Replication factor

 QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 24498

 4.01

 1.51

 7.6

 12.51

 31.5

 129.6

 2

 1

 28219

 3.38

 0.85

 9.5

 17.71

 39.2

 152.2

 2

 2

 35383

 4.06

 0.87

 9.71

 21.25

 70.3

 215.9

 4

 1

 34648

 2.78

 0.86

 6.07

 14.94

 30.8

 134.6

 4

 3

 52932

 3.45

 0.86

 10.81

 21.05

 37.4

 189.1



 The test data I use is generated randomly, and the schema I use is like (I
 use the cqlsh to create the columnfamily/table):

 CREATE TABLE table(

 id1  varchar,

 ts   varchar,

 id2  varchar,

 msg  varchar,

 PRIMARY KEY(id1, ts, id2));

 So the fields are all string and I generate each character of the string
 randomly, using srand(time(0)) and rand() in C++, so I think my test data
 could be uniformly distributed into the Cassandra cluster. And, in my
 client stress test code, I use thrift C++ interface, and the basic
 operation I do is like:

 thrift_client.execute_cql3_query(“INSERT INTO table WHERE id1=xxx, ts=xxx,
 id2=xxx, msg=xxx”); and thrift_client.execute_cql3_query(“SELECT FROM table
 WHERE id1=xxx”);

 Each data entry I INSERT of SELECT is of around 100 characters.

 On my stress test client, I create several threads to send 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Eric Stevens
I'm sorry, I meant to say 6 nodes rf=3.

Also look at this performance over sustained periods of times, not burst
writing.  Run your test for several hours and watch memory and especially
compaction stats.  See if you can walk in what data volume you can write
while keeping outstanding compaction tasks  5 (preferably 0 or 1) for
sustained periods.  Measuring just burst writes will definitely mask real
world conditions, and Cassandra actually absorbs bursted writes really well
(which in turn masks performance problems since by the time your write
times suffer from overwhelming a cluster, you're probably already in insane
and difficult to recover crisis mode).

On Sun Dec 07 2014 at 8:55:47 AM Eric Stevens migh...@gmail.com wrote:

 Hi Joy,

 Are you resetting your data after each test run?  I wonder if your tests
 are actually causing you to fall behind on data grooming tasks such as
 compaction, and so performance suffers for your later tests.

 There are *so many* factors which can affect performance, without
 reviewing test methodology in great detail, it's really hard to say whether
 there are flaws which might uncover an antipattern cause atypical number of
 cache hits or misses, and so forth. You may also be producing gc pressure
 in the write path, and so forth.

 I *can* say that 28k writes per second looks just a little low, but it
 depends a lot on your network, hardware, and write patterns (eg, data
 size).  For a little performance test suite I wrote, with parallel batched
 writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
 second.

 Also focusing exclusively on max latency is going to cause you some
 troubles especially in the case of magnetic media as you're using.  Between
 ill-timed GC and inconsistent performance characteristics from magnetic
 media, your max numbers will often look significantly worse than your p(99)
 or p(999) numbers.

 All this said, one node will often look better than several nodes for
 certain patterns because it completely eliminates proxy (coordinator) write
 times.  All writes are local writes.  It's an over-simple case that doesn't
 reflect any practical production use of Cassandra, so it's probably not
 worth even including in your tests.  I would recommend start at 3 nodes
 rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
 compaction and aren't seeing garbage collections in the logs (either of
 those will be polluting your results with variability you can't account for
 with small sample sizes of ~1 million).

 If you expect to sustain write volumes like this, you'll find these
 clusters are sized too small (on that hardware you won't keep up with
 compaction), and your tests are again testing scenarios you wouldn't
 actually see in production.

 On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test tool, but use my own stress-test client code
 instead(I write some C++ stress test code). My Cassandra cluster is
 deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
 instances ) in the Datastax document, and I am not using EBS, just using
 the ephemeral storage by default. The EC2 type of Cassandra servers are
 m3.xlarge. I use another EC2 instance for my stress test client, which is
 of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
 node are in us-east. I test the Cassandra cluster which is made up of 1
 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
 test separately, but the performance doesn’t get linear increment when new
 nodes are added. Also I get some weird results. My test results are as
 follows(*I do 1 million operations and I try to get the best QPS when
 the max latency is no more than 200ms, and the latencies are measured from
 the client side. The QPS is calculated by total_operations/total_time).*



 *INSERT(write):*

 Node count

 Replication factor

   QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 18687

 2.08

 1.48

 2.95

 5.74

 52.8

 205.4

 2

 1

 20793

 3.15

 0.84

 7.71

 41.35

 88.7

 232.7

 2

 2

 22498

 3.37

 0.86

 6.04

 36.1

 221.5

 649.3

 4

 1

 28348

 4.38

 0.85

 8.19

 64.51

 169.4

 251.9

 4

 3

 28631

 5.22

 0.87

 18.68

 68.35

 167.2

 288



 *SELECT(read):*

 Node count

 Replication factor

 QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 24498

 4.01

 1.51

 7.6

 12.51

 31.5

 129.6

 2

 1

 28219

 3.38

 0.85

 9.5

 17.71

 39.2

 152.2

 2

 2

 35383

 4.06

 0.87

 9.71

 21.25

 70.3

 215.9

 4

 1

 34648

 2.78

 0.86

 6.07

 14.94

 30.8

 134.6

 4

 3

 52932

 3.45

 0.86

 10.81

 21.05

 37.4

 189.1



 The test data I use is generated randomly, and the schema I use is like
 (I use 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread 孔嘉林
Hi Eric,
Thank you very much for your reply!
Do you mean that I should clear my table after each run? Indeed, I can see
several times of compaction during my test, but could only a few times
compaction affect the performance that much? Also, I can see from the
OpsCenter some ParNew GC happen but no CMS GC happen.

I run my test on EC2 cluster, I think the network could be of high speed
with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD
storage, which is of m3.xlarge type.

As for latency, which latency should I care about most? p(99) or p(999)? I
want to get the max QPS under a certain limited latency.

I know my testing scenario are not the common case in production, I just
want to know how much burden my cluster can bear under stress.

So, how did you test your cluster that can get 86k writes/sec? How many
requests did you send to your cluster? Was it also 1 million? Did you also
use OpsCenter to monitor the real time performance? I also wonder why the
write and read QPS OpsCenter provide are much lower than what I calculate.
Could you please describe in detail about your test deployment?

Thank you very much,
Joy

2014-12-07 23:55 GMT+08:00 Eric Stevens migh...@gmail.com:

 Hi Joy,

 Are you resetting your data after each test run?  I wonder if your tests
 are actually causing you to fall behind on data grooming tasks such as
 compaction, and so performance suffers for your later tests.

 There are *so many* factors which can affect performance, without
 reviewing test methodology in great detail, it's really hard to say whether
 there are flaws which might uncover an antipattern cause atypical number of
 cache hits or misses, and so forth. You may also be producing gc pressure
 in the write path, and so forth.

 I *can* say that 28k writes per second looks just a little low, but it
 depends a lot on your network, hardware, and write patterns (eg, data
 size).  For a little performance test suite I wrote, with parallel batched
 writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
 second.

 Also focusing exclusively on max latency is going to cause you some
 troubles especially in the case of magnetic media as you're using.  Between
 ill-timed GC and inconsistent performance characteristics from magnetic
 media, your max numbers will often look significantly worse than your p(99)
 or p(999) numbers.

 All this said, one node will often look better than several nodes for
 certain patterns because it completely eliminates proxy (coordinator) write
 times.  All writes are local writes.  It's an over-simple case that doesn't
 reflect any practical production use of Cassandra, so it's probably not
 worth even including in your tests.  I would recommend start at 3 nodes
 rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
 compaction and aren't seeing garbage collections in the logs (either of
 those will be polluting your results with variability you can't account for
 with small sample sizes of ~1 million).

 If you expect to sustain write volumes like this, you'll find these
 clusters are sized too small (on that hardware you won't keep up with
 compaction), and your tests are again testing scenarios you wouldn't
 actually see in production.

 On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test tool, but use my own stress-test client code
 instead(I write some C++ stress test code). My Cassandra cluster is
 deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
 instances ) in the Datastax document, and I am not using EBS, just using
 the ephemeral storage by default. The EC2 type of Cassandra servers are
 m3.xlarge. I use another EC2 instance for my stress test client, which is
 of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
 node are in us-east. I test the Cassandra cluster which is made up of 1
 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
 test separately, but the performance doesn’t get linear increment when new
 nodes are added. Also I get some weird results. My test results are as
 follows(*I do 1 million operations and I try to get the best QPS when
 the max latency is no more than 200ms, and the latencies are measured from
 the client side. The QPS is calculated by total_operations/total_time).*



 *INSERT(write):*

 Node count

 Replication factor

   QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 18687

 2.08

 1.48

 2.95

 5.74

 52.8

 205.4

 2

 1

 20793

 3.15

 0.84

 7.71

 41.35

 88.7

 232.7

 2

 2

 22498

 3.37

 0.86

 6.04

 36.1

 221.5

 649.3

 4

 1

 28348

 4.38

 0.85

 8.19

 64.51

 169.4

 251.9

 4

 3

 28631

 5.22

 0.87

 18.68

 68.35

 167.2

 288



 *SELECT(read):*

 Node count

 Replication factor

 QPS

 Average 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Chris Lohfink
I think your client could use improvements.  How many threads do you have
running in your test?  With a thrift call like that you only can do one
request at a time per connection.   For example, assuming C* takes 0ms, a
10ms network latency/driver overhead will mean 20ms RTT and a max
throughput of ~50 QPS per thread (native binary doesn't behave like this).
Are you running client on its own system or shared with a node?  how are
you load balancing your requests?  Source code would help since theres a
lot that can become a bottleneck.

Generally you will see a bit of a dip in latency from N=RF=1 and N=2, RF=2
etc since there are optimizations on the coordinator node when it doesn't
need to send the request to the replicas.  The impact of the network
overhead decreases in significance as cluster grows.  Typically; latency
wise, RF=N=1 is going to be fastest possible for smaller loads (ie when a
client cannot fully saturate a single node).

Main thing to expect is that latency will plateau and remain fairly
constant as load/nodes increase while throughput potential will linearly
(empirically at least) increase.

You should really attempt it with the native binary + prepared statements,
running cql over thrift is far from optimal.  I would recommend using the
cassandra-stress tool if you want to stress test Cassandra (and not your
code)
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

===
Chris Lohfink

On Sun, Dec 7, 2014 at 9:48 PM, 孔嘉林 kongjiali...@gmail.com wrote:

 Hi Eric,
 Thank you very much for your reply!
 Do you mean that I should clear my table after each run? Indeed, I can see
 several times of compaction during my test, but could only a few times
 compaction affect the performance that much? Also, I can see from the
 OpsCenter some ParNew GC happen but no CMS GC happen.

 I run my test on EC2 cluster, I think the network could be of high speed
 with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD
 storage, which is of m3.xlarge type.

 As for latency, which latency should I care about most? p(99) or p(999)? I
 want to get the max QPS under a certain limited latency.

 I know my testing scenario are not the common case in production, I just
 want to know how much burden my cluster can bear under stress.

 So, how did you test your cluster that can get 86k writes/sec? How many
 requests did you send to your cluster? Was it also 1 million? Did you also
 use OpsCenter to monitor the real time performance? I also wonder why the
 write and read QPS OpsCenter provide are much lower than what I calculate.
 Could you please describe in detail about your test deployment?

 Thank you very much,
 Joy

 2014-12-07 23:55 GMT+08:00 Eric Stevens migh...@gmail.com:

 Hi Joy,

 Are you resetting your data after each test run?  I wonder if your tests
 are actually causing you to fall behind on data grooming tasks such as
 compaction, and so performance suffers for your later tests.

 There are *so many* factors which can affect performance, without
 reviewing test methodology in great detail, it's really hard to say whether
 there are flaws which might uncover an antipattern cause atypical number of
 cache hits or misses, and so forth. You may also be producing gc pressure
 in the write path, and so forth.

 I *can* say that 28k writes per second looks just a little low, but it
 depends a lot on your network, hardware, and write patterns (eg, data
 size).  For a little performance test suite I wrote, with parallel batched
 writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
 second.

 Also focusing exclusively on max latency is going to cause you some
 troubles especially in the case of magnetic media as you're using.  Between
 ill-timed GC and inconsistent performance characteristics from magnetic
 media, your max numbers will often look significantly worse than your p(99)
 or p(999) numbers.

 All this said, one node will often look better than several nodes for
 certain patterns because it completely eliminates proxy (coordinator) write
 times.  All writes are local writes.  It's an over-simple case that doesn't
 reflect any practical production use of Cassandra, so it's probably not
 worth even including in your tests.  I would recommend start at 3 nodes
 rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
 compaction and aren't seeing garbage collections in the logs (either of
 those will be polluting your results with variability you can't account for
 with small sample sizes of ~1 million).

 If you expect to sustain write volumes like this, you'll find these
 clusters are sized too small (on that hardware you won't keep up with
 compaction), and your tests are again testing scenarios you wouldn't
 actually see in production.

 On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test