Could ring cache really improve performance in Cassandra?

2014-12-07 Thread kong
Hi, 

I'm doing stress test on Cassandra. And I learn that using ring cache can
improve the performance because the client requests can directly go to the
target Cassandra server and the coordinator Cassandra node is the desired
target node. In this way, there is no need for coordinator node to route the
client requests to the target node, and maybe we can get the linear
performance increment.

 

However, in my stress test on an Amazon EC2 cluster, the test results are
weird. Seems that there's no performance improvement after using ring cache.
Could anyone help me explain this results? (Also, I think the results of
test without ring cache is weird, because there's no linear increment on QPS
when new nodes are added. I need help on explaining this, too). The results
are as follows:

 

INSERT(write):


Node count

Replication factor

QPS(No ring cache)

QPS(ring cache)


1

1

18687

20195


2

1

20793

26403


2

2

22498

21263


4

1

28348

30010


4

3

28631

24413

 

SELECT(read):


Node count

Replication factor

QPS(No ring cache)

QPS(ring cache)


1

1

24498

22802


2

1

28219

27030


2

2

35383

36674


4

1

34648

28347


4

3

52932

52590

 

 

Thank you very much,

Joy



Re: How to model data to achieve specific data locality

2014-12-07 Thread DuyHai Doan
Those sequences are not fixed. All sequences with the same seq_id tend to
grow at the same rate. If it's one partition per seq_id, the size will most
likely exceed the threshold quickly

-- Then use bucketing to avoid too wide partitions

Also new seq_types can be added and old seq_types can be deleted. This
means I often need to ALTER TABLE to add and drop columns. I am not sure if
this is a good practice from operation point of view.

 -- I don't understand why altering table is necessary to add seq_types.
If seq_types is defined as your clustering column, you can have many of
them using the same table structure ...





On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang dep...@gmail.com wrote:

 On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens migh...@gmail.com wrote:

 It depends on the size of your data, but if your data is reasonably
 small, there should be no trouble including thousands of records on the
 same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
 ought to work fine.

 If the data size per partition exceeds some threshold that represents the
 right tradeoff of increasing repair cost, gc pressure, threatening
 unbalanced loads, and other issues that come with wide partitions, then you
 can subpartition via some means in a manner consistent with your work load,
 with something like PRIMARY KEY ((seq_id, subpartition), seq_type).

 For example, if seq_type can be processed for a given seq_id in any
 order, and you need to be able to locate specific records for a known
 seq_id/seq_type pair, you can compute subpartition is computed
 deterministically.  Or if you only ever need to read *all* values for a
 given seq_id, and the processing order is not important, just randomly
 generate a value for subpartition at write time, as long as you can know
 all possible values for subpartition.

 If the values for the seq_types for a given seq_id must always be
 processed in order based on seq_type, then your subpartition calculation
 would need to reflect that and place adjacent seq_types in the same
 partition.  As a contrived example, say seq_type was an incrementing
 integer, your subpartition could be seq_type / 100.

 On Fri Dec 05 2014 at 7:34:38 PM Kai Wang dep...@gmail.com wrote:

 I have a data model question. I am trying to figure out how to model the
 data to achieve the best data locality for analytic purpose. Our
 application processes sequences. Each sequence has a unique key in the
 format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
 number of seq_types. The typical read is to load a subset of sequences with
 the same seq_id. Naturally I would like to have all the sequences with the
 same seq_id to co-locate on the same node(s).


 However I can't simply create one partition per seq_id and use seq_id as
 my partition key. That's because:


 1. there could be thousands or even more seq_types for each seq_id. It's
 not feasible to include all the seq_types into one table.

 2. each seq_id might have different sets of seq_types.

 3. each application only needs to access a subset of seq_types for a
 seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
 prefer only touching the data that's needed.


 As per above, I think I should use one partition per
 [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? One
 possible approach is to override IPartitioner so that I just use part of
 the field (say 64 bytes) to get the token (for location) while still using
 the whole field as partition key (for look up). But before heading that
 direction, I would like to see if there are better options out there. Maybe
 any new or upcoming features in C* 3.0?


 Thanks.


 Thanks, Eric.

 Those sequences are not fixed. All sequences with the same seq_id tend to
 grow at the same rate. If it's one partition per seq_id, the size will most
 likely exceed the threshold quickly. Also new seq_types can be added and
 old seq_types can be deleted. This means I often need to ALTER TABLE to add
 and drop columns. I am not sure if this is a good practice from operation
 point of view.

 I thought about your subpartition idea. If there are only a few
 applications and each one of them uses a subset of seq_types, I can easily
 create one table per application since I can compute the subpartition
 deterministically as you said. But in my case data scientists need to
 easily write new applications using any combination of seq_types of a
 seq_id. So I want the data model to be flexible enough to support
 applications using any different set of seq_types without creating new
 tables, duplicate all the data etc.

 -Kai





Re: How to model data to achieve specific data locality

2014-12-07 Thread Eric Stevens
 Also new seq_types can be added and old seq_types can be deleted. This
means I often need to ALTER TABLE to add and drop columns.

Kai, unless I'm misunderstanding something, I don't see why you need to
alter the table to add a new seq type.  From a data model perspective,
these are just new values in a row.

If you do have columns which are specific to particular seq_types, data
modeling does become a little more challenging.  In that case you may get
some advantage from using collections (especially map) to store data which
applies to only a few seq types.  Or defining a schema which includes the
set of all possible columns (that's when you're getting into ALTERs when a
new column comes or goes).

 All sequences with the same seq_id tend to grow at the same rate.

Note that it is an anti pattern in Cassandra to append to the same row
indefinitely.  I think you understand this because of your original
question.  But please note that a sub partitioning strategy which reuses
subpartitions will result in degraded read performance after a while.
You'll need to rotate sub partitions by something that doesn't repeat in
order to keep the data for a given partition key grouped into just a few
sstables.  A typical pattern there is to use some kind of time bucket
(hour, day, week, etc., depending on your write volume).

I do note that your original question was about preserving data locality -
and having a consistent locality for a given seq_id - for best offline
analytics.  If you wanted to work for this, you can certainly also include
a blob value in your partitioning key, whose value is calculated to force a
ring collision with this record's sibling data.  With Cassandra's default
partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
designed to be cryptographically strong (it doesn't work to make it
difficult to force a collision), but it's meant to have good distribution
(it may still be computationally expensive to force a collision - I'm not
that familiar with its internal workings).  In this case,
ByteOrderedPartitioner would be a lot easier to force a ring collision on,
but then you need to work on a good ring balancing strategy to distribute
your data evenly over the ring.

On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com wrote:

 Those sequences are not fixed. All sequences with the same seq_id tend
 to grow at the same rate. If it's one partition per seq_id, the size will
 most likely exceed the threshold quickly

 -- Then use bucketing to avoid too wide partitions

 Also new seq_types can be added and old seq_types can be deleted. This
 means I often need to ALTER TABLE to add and drop columns. I am not sure if
 this is a good practice from operation point of view.

  -- I don't understand why altering table is necessary to add seq_types.
 If seq_types is defined as your clustering column, you can have many of
 them using the same table structure ...





 On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang dep...@gmail.com wrote:

 On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens migh...@gmail.com wrote:

 It depends on the size of your data, but if your data is reasonably
 small, there should be no trouble including thousands of records on the
 same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
 ought to work fine.

 If the data size per partition exceeds some threshold that represents
 the right tradeoff of increasing repair cost, gc pressure, threatening
 unbalanced loads, and other issues that come with wide partitions, then you
 can subpartition via some means in a manner consistent with your work load,
 with something like PRIMARY KEY ((seq_id, subpartition), seq_type).

 For example, if seq_type can be processed for a given seq_id in any
 order, and you need to be able to locate specific records for a known
 seq_id/seq_type pair, you can compute subpartition is computed
 deterministically.  Or if you only ever need to read *all* values for a
 given seq_id, and the processing order is not important, just randomly
 generate a value for subpartition at write time, as long as you can know
 all possible values for subpartition.

 If the values for the seq_types for a given seq_id must always be
 processed in order based on seq_type, then your subpartition calculation
 would need to reflect that and place adjacent seq_types in the same
 partition.  As a contrived example, say seq_type was an incrementing
 integer, your subpartition could be seq_type / 100.

 On Fri Dec 05 2014 at 7:34:38 PM Kai Wang dep...@gmail.com wrote:

 I have a data model question. I am trying to figure out how to model
 the data to achieve the best data locality for analytic purpose. Our
 application processes sequences. Each sequence has a unique key in the
 format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
 number of seq_types. The typical read is to load a subset of sequences with
 the same seq_id. Naturally I would like to have all the sequences 

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jack Krupansky
It would be helpful to look at some specific examples of sequences, showing how 
they grow. I suspect that the term “sequence” is being overloaded in some 
subtly misleading way here.

Besides, we’ve already answered the headline question – data locality is 
achieved by having a common partition key. So, we need some clarity as to what 
question we are really focusing on

And, of course, we should be asking the “Cassandra Data Modeling 101” question 
of what do your queries want to look like, how exactly do you want to access 
your data. Only after we have a handle on how you need to read your data can we 
decide how it should be stored.

My immediate question to get things back on track: When you say “The typical 
read is to load a subset of sequences with the same seq_id”, what type of 
“subset” are you talking about? Again, a few explicit and concise example 
queries (in some concise, easy to read pseudo language or even plain English, 
but not belabored with full CQL syntax.) would be very helpful. I mean, 
Cassandra has no “subset” concept, nor a “load subset” command, so what are we 
really talking about?

Also, I presume we are talking CQL, but some of the references seem more 
Thrift/slice oriented.

-- Jack Krupansky

From: Eric Stevens 
Sent: Sunday, December 7, 2014 10:12 AM
To: user@cassandra.apache.org 
Subject: Re: How to model data to achieve specific data locality

 Also new seq_types can be added and old seq_types can be deleted. This means 
 I often need to ALTER TABLE to add and drop columns. 

Kai, unless I'm misunderstanding something, I don't see why you need to alter 
the table to add a new seq type.  From a data model perspective, these are just 
new values in a row.  

If you do have columns which are specific to particular seq_types, data 
modeling does become a little more challenging.  In that case you may get some 
advantage from using collections (especially map) to store data which applies 
to only a few seq types.  Or defining a schema which includes the set of all 
possible columns (that's when you're getting into ALTERs when a new column 
comes or goes).

 All sequences with the same seq_id tend to grow at the same rate.


Note that it is an anti pattern in Cassandra to append to the same row 
indefinitely.  I think you understand this because of your original question.  
But please note that a sub partitioning strategy which reuses subpartitions 
will result in degraded read performance after a while.  You'll need to rotate 
sub partitions by something that doesn't repeat in order to keep the data for a 
given partition key grouped into just a few sstables.  A typical pattern there 
is to use some kind of time bucket (hour, day, week, etc., depending on your 
write volume).


I do note that your original question was about preserving data locality - and 
having a consistent locality for a given seq_id - for best offline analytics.  
If you wanted to work for this, you can certainly also include a blob value in 
your partitioning key, whose value is calculated to force a ring collision with 
this record's sibling data.  With Cassandra's default partitioner of murmur3, 
that's probably pretty challenging - murmur3 isn't designed to be 
cryptographically strong (it doesn't work to make it difficult to force a 
collision), but it's meant to have good distribution (it may still be 
computationally expensive to force a collision - I'm not that familiar with its 
internal workings).  In this case, ByteOrderedPartitioner would be a lot easier 
to force a ring collision on, but then you need to work on a good ring 
balancing strategy to distribute your data evenly over the ring.

On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com wrote:

  Those sequences are not fixed. All sequences with the same seq_id tend to 
grow at the same rate. If it's one partition per seq_id, the size will most 
likely exceed the threshold quickly 


  -- Then use bucketing to avoid too wide partitions


  Also new seq_types can be added and old seq_types can be deleted. This means 
I often need to ALTER TABLE to add and drop columns. I am not sure if this is a 
good practice from operation point of view.


  -- I don't understand why altering table is necessary to add seq_types. If 
seq_types is defined as your clustering column, you can have many of them 
using the same table structure ...









  On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang dep...@gmail.com wrote:

On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens migh...@gmail.com wrote:

  It depends on the size of your data, but if your data is reasonably 
small, there should be no trouble including thousands of records on the same 
partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type) ought to 
work fine.  


  If the data size per partition exceeds some threshold that represents the 
right tradeoff of increasing repair cost, gc pressure, threatening unbalanced 
loads, and other issues that come with wide 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Eric Stevens
Hi Joy,

Are you resetting your data after each test run?  I wonder if your tests
are actually causing you to fall behind on data grooming tasks such as
compaction, and so performance suffers for your later tests.

There are *so many* factors which can affect performance, without reviewing
test methodology in great detail, it's really hard to say whether there are
flaws which might uncover an antipattern cause atypical number of cache
hits or misses, and so forth. You may also be producing gc pressure in the
write path, and so forth.

I *can* say that 28k writes per second looks just a little low, but it
depends a lot on your network, hardware, and write patterns (eg, data
size).  For a little performance test suite I wrote, with parallel batched
writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
second.

Also focusing exclusively on max latency is going to cause you some
troubles especially in the case of magnetic media as you're using.  Between
ill-timed GC and inconsistent performance characteristics from magnetic
media, your max numbers will often look significantly worse than your p(99)
or p(999) numbers.

All this said, one node will often look better than several nodes for
certain patterns because it completely eliminates proxy (coordinator) write
times.  All writes are local writes.  It's an over-simple case that doesn't
reflect any practical production use of Cassandra, so it's probably not
worth even including in your tests.  I would recommend start at 3 nodes
rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
compaction and aren't seeing garbage collections in the logs (either of
those will be polluting your results with variability you can't account for
with small sample sizes of ~1 million).

If you expect to sustain write volumes like this, you'll find these
clusters are sized too small (on that hardware you won't keep up with
compaction), and your tests are again testing scenarios you wouldn't
actually see in production.

On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test tool, but use my own stress-test client code
 instead(I write some C++ stress test code). My Cassandra cluster is
 deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
 instances ) in the Datastax document, and I am not using EBS, just using
 the ephemeral storage by default. The EC2 type of Cassandra servers are
 m3.xlarge. I use another EC2 instance for my stress test client, which is
 of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
 node are in us-east. I test the Cassandra cluster which is made up of 1
 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
 test separately, but the performance doesn’t get linear increment when new
 nodes are added. Also I get some weird results. My test results are as
 follows(*I do 1 million operations and I try to get the best QPS when the
 max latency is no more than 200ms, and the latencies are measured from the
 client side. The QPS is calculated by total_operations/total_time).*



 *INSERT(write):*

 Node count

 Replication factor

   QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 18687

 2.08

 1.48

 2.95

 5.74

 52.8

 205.4

 2

 1

 20793

 3.15

 0.84

 7.71

 41.35

 88.7

 232.7

 2

 2

 22498

 3.37

 0.86

 6.04

 36.1

 221.5

 649.3

 4

 1

 28348

 4.38

 0.85

 8.19

 64.51

 169.4

 251.9

 4

 3

 28631

 5.22

 0.87

 18.68

 68.35

 167.2

 288



 *SELECT(read):*

 Node count

 Replication factor

 QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 24498

 4.01

 1.51

 7.6

 12.51

 31.5

 129.6

 2

 1

 28219

 3.38

 0.85

 9.5

 17.71

 39.2

 152.2

 2

 2

 35383

 4.06

 0.87

 9.71

 21.25

 70.3

 215.9

 4

 1

 34648

 2.78

 0.86

 6.07

 14.94

 30.8

 134.6

 4

 3

 52932

 3.45

 0.86

 10.81

 21.05

 37.4

 189.1



 The test data I use is generated randomly, and the schema I use is like (I
 use the cqlsh to create the columnfamily/table):

 CREATE TABLE table(

 id1  varchar,

 ts   varchar,

 id2  varchar,

 msg  varchar,

 PRIMARY KEY(id1, ts, id2));

 So the fields are all string and I generate each character of the string
 randomly, using srand(time(0)) and rand() in C++, so I think my test data
 could be uniformly distributed into the Cassandra cluster. And, in my
 client stress test code, I use thrift C++ interface, and the basic
 operation I do is like:

 thrift_client.execute_cql3_query(“INSERT INTO table WHERE id1=xxx, ts=xxx,
 id2=xxx, msg=xxx”); and thrift_client.execute_cql3_query(“SELECT FROM table
 WHERE id1=xxx”);

 Each data entry I INSERT of SELECT is of around 100 characters.

 On my stress test client, I create several threads to send 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Eric Stevens
I'm sorry, I meant to say 6 nodes rf=3.

Also look at this performance over sustained periods of times, not burst
writing.  Run your test for several hours and watch memory and especially
compaction stats.  See if you can walk in what data volume you can write
while keeping outstanding compaction tasks  5 (preferably 0 or 1) for
sustained periods.  Measuring just burst writes will definitely mask real
world conditions, and Cassandra actually absorbs bursted writes really well
(which in turn masks performance problems since by the time your write
times suffer from overwhelming a cluster, you're probably already in insane
and difficult to recover crisis mode).

On Sun Dec 07 2014 at 8:55:47 AM Eric Stevens migh...@gmail.com wrote:

 Hi Joy,

 Are you resetting your data after each test run?  I wonder if your tests
 are actually causing you to fall behind on data grooming tasks such as
 compaction, and so performance suffers for your later tests.

 There are *so many* factors which can affect performance, without
 reviewing test methodology in great detail, it's really hard to say whether
 there are flaws which might uncover an antipattern cause atypical number of
 cache hits or misses, and so forth. You may also be producing gc pressure
 in the write path, and so forth.

 I *can* say that 28k writes per second looks just a little low, but it
 depends a lot on your network, hardware, and write patterns (eg, data
 size).  For a little performance test suite I wrote, with parallel batched
 writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
 second.

 Also focusing exclusively on max latency is going to cause you some
 troubles especially in the case of magnetic media as you're using.  Between
 ill-timed GC and inconsistent performance characteristics from magnetic
 media, your max numbers will often look significantly worse than your p(99)
 or p(999) numbers.

 All this said, one node will often look better than several nodes for
 certain patterns because it completely eliminates proxy (coordinator) write
 times.  All writes are local writes.  It's an over-simple case that doesn't
 reflect any practical production use of Cassandra, so it's probably not
 worth even including in your tests.  I would recommend start at 3 nodes
 rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
 compaction and aren't seeing garbage collections in the logs (either of
 those will be polluting your results with variability you can't account for
 with small sample sizes of ~1 million).

 If you expect to sustain write volumes like this, you'll find these
 clusters are sized too small (on that hardware you won't keep up with
 compaction), and your tests are again testing scenarios you wouldn't
 actually see in production.

 On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test tool, but use my own stress-test client code
 instead(I write some C++ stress test code). My Cassandra cluster is
 deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
 instances ) in the Datastax document, and I am not using EBS, just using
 the ephemeral storage by default. The EC2 type of Cassandra servers are
 m3.xlarge. I use another EC2 instance for my stress test client, which is
 of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
 node are in us-east. I test the Cassandra cluster which is made up of 1
 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
 test separately, but the performance doesn’t get linear increment when new
 nodes are added. Also I get some weird results. My test results are as
 follows(*I do 1 million operations and I try to get the best QPS when
 the max latency is no more than 200ms, and the latencies are measured from
 the client side. The QPS is calculated by total_operations/total_time).*



 *INSERT(write):*

 Node count

 Replication factor

   QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 18687

 2.08

 1.48

 2.95

 5.74

 52.8

 205.4

 2

 1

 20793

 3.15

 0.84

 7.71

 41.35

 88.7

 232.7

 2

 2

 22498

 3.37

 0.86

 6.04

 36.1

 221.5

 649.3

 4

 1

 28348

 4.38

 0.85

 8.19

 64.51

 169.4

 251.9

 4

 3

 28631

 5.22

 0.87

 18.68

 68.35

 167.2

 288



 *SELECT(read):*

 Node count

 Replication factor

 QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 24498

 4.01

 1.51

 7.6

 12.51

 31.5

 129.6

 2

 1

 28219

 3.38

 0.85

 9.5

 17.71

 39.2

 152.2

 2

 2

 35383

 4.06

 0.87

 9.71

 21.25

 70.3

 215.9

 4

 1

 34648

 2.78

 0.86

 6.07

 14.94

 30.8

 134.6

 4

 3

 52932

 3.45

 0.86

 10.81

 21.05

 37.4

 189.1



 The test data I use is generated randomly, and the schema I use is like
 (I use 

Re: full gc too often

2014-12-07 Thread Philo Yang
2014-12-05 15:40 GMT+08:00 Jonathan Haddad j...@jonhaddad.com:

 I recommend reading through
 https://issues.apache.org/jira/browse/CASSANDRA-8150 to get an idea of
 how the JVM GC works and what you can do to tune it.  Also good is Blake
 Eggleston's writeup which can be found here:
 http://blakeeggleston.com/cassandra-tuning-the-jvm-for-read-heavy-workloads.html

 I'd like to note that allocating 4GB heap to Cassandra under any serious
 workload is unlikely to be sufficient.


Thanks for your recommendation. After reading I try to allocate a larger
heap and it is useful for me. 4G heap can't handle the workload in my use
case indeed.

So another question is, how much pressure dose default max heap (8G) can
handle? The pressure may not be a simple qps, you know, slice query for
many columns in a row will allocate more objects in heap than the query for
a single column. Is there any testing result for the relationship between
the pressure and the safety heap size? We know query a slice with many
tombstones is not a good use case, but query a slice without tombstones may
be a common use case, right?




 On Thu Dec 04 2014 at 8:43:38 PM Philo Yang ud1...@gmail.com wrote:

 I have two kinds of machine:
 16G RAM, with default heap size setting, about 4G.
 64G RAM, with default heap size setting, about 8G.

 These two kinds of nodes have same number of vnodes, and both of them
 have gc issue, although the node of 16G have a higher probability  of gc
 issue.

 Thanks,
 Philo Yang


 2014-12-05 12:34 GMT+08:00 Tim Heckman t...@pagerduty.com:

 On Dec 4, 2014 8:14 PM, Philo Yang ud1...@gmail.com wrote:
 
  Hi,all
 
  I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with
 full gc that sometime there may be one or two nodes full gc more than one
 time per minute and over 10 seconds each time, then the node will be
 unreachable and the latency of cluster will be increased.
 
  I grep the GCInspector's log, I found when the node is running fine
 without gc trouble there are two kinds of gc:
  ParNew GC in less than 300ms which clear the Par Eden Space and
 enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in
 more than 200ms, there is only a small number of ParNew GC in log)
  ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and
 enlarge Par Eden Space little, each 1-2 hours it will be executed once.
 
  However, sometimes ConcurrentMarkSweep will be strange like it shows:
 
  INFO  [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12648ms.  CMS Old Gen: 3579838424 -
 3579838464; Par Eden Space: 503316480 - 294794576; Par Survivor Space:
 62914528 - 0
  INFO  [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12227ms.  CMS Old Gen: 3579838464 -
 3579836512; Par Eden Space: 503316480 - 310562032; Par Survivor Space:
 62872496 - 0
  INFO  [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11538ms.  CMS Old Gen: 3579836688 -
 3579805792; Par Eden Space: 503316480 - 332391096; Par Survivor Space:
 62914544 - 0
  INFO  [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12180ms.  CMS Old Gen: 3579835784 -
 3579829760; Par Eden Space: 503316480 - 351991456; Par Survivor Space:
 62914552 - 0
  INFO  [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 10574ms.  CMS Old Gen: 3579838112 -
 3579799752; Par Eden Space: 503316480 - 366222584; Par Survivor Space:
 62914560 - 0
  INFO  [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11594ms.  CMS Old Gen: 3579831424 -
 3579817392; Par Eden Space: 503316480 - 388702928; Par Survivor Space:
 62914552 - 0
  INFO  [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11463ms.  CMS Old Gen: 3579817392 -
 3579838424; Par Eden Space: 503316480 - 408992784; Par Survivor Space:
 62896720 - 0
  INFO  [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 9576ms.  CMS Old Gen: 3579838424 - 3579816424;
 Par Eden Space: 503316480 - 438633608; Par Survivor Space: 62914544 - 0
  INFO  [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11556ms.  CMS Old Gen: 3579816424 -
 3579785496; Par Eden Space: 503316480 - 441354856; Par Survivor Space:
 62889528 - 0
  INFO  [Service Thread] 2014-12-05 11:30:54,085 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12082ms.  CMS Old Gen: 3579786592 -
 3579814464; Par Eden Space: 503316480 - 448782440; Par Survivor Space:
 62914560 - 0
 
  In each time Old Gen reduce only a little, Survivor Space will be
 clear but the heap is still full so there will be another full gc very soon
 then the node will down. If I restart the node, it will be fine without gc
 trouble.
 
  Can anyone help me to find out where is the problem that full gc can't
 reduce CMS Old Gen? Is 

Re: Could ring cache really improve performance in Cassandra?

2014-12-07 Thread Jonathan Haddad
What's a ring cache?

FYI if you're using the DataStax CQL drivers they will automatically route
requests to the correct node.

On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I'm doing stress test on Cassandra. And I learn that using ring cache can
 improve the performance because the client requests can directly go to the
 target Cassandra server and the coordinator Cassandra node is the desired
 target node. In this way, there is no need for coordinator node to route
 the client requests to the target node, and maybe we can get the linear
 performance increment.



 However, in my stress test on an Amazon EC2 cluster, the test results are
 weird. Seems that there's no performance improvement after using ring
 cache. Could anyone help me explain this results? (Also, I think the
 results of test without ring cache is weird, because there's no linear
 increment on QPS when new nodes are added. I need help on explaining this,
 too). The results are as follows:



 INSERT(write):

 Node count

 Replication factor

 QPS(No ring cache)

 QPS(ring cache)

 1

 1

 18687

 20195

 2

 1

 20793

 26403

 2

 2

 22498

 21263

 4

 1

 28348

 30010

 4

 3

 28631

 24413



 SELECT(read):

 Node count

 Replication factor

 QPS(No ring cache)

 QPS(ring cache)

 1

 1

 24498

 22802

 2

 1

 28219

 27030

 2

 2

 35383

 36674

 4

 1

 34648

 28347

 4

 3

 52932

 52590





 Thank you very much,

 Joy



Re: full gc too often

2014-12-07 Thread Jonathan Haddad
There's a lot of factors that go into tuning, and I don't know of any
reliable formula that you can use to figure out what's going to work
optimally for your hardware.  Personally I recommend:

1) find the bottleneck
2) playing with a parameter (or two)
3) see what changed, performance wise

If you've got a specific question I think someone can find a way to help,
but asking what can 8gb of heap give me is pretty abstract and
unanswerable.

Jon

On Sun Dec 07 2014 at 8:03:53 AM Philo Yang ud1...@gmail.com wrote:

 2014-12-05 15:40 GMT+08:00 Jonathan Haddad j...@jonhaddad.com:

 I recommend reading through https://issues.apache.
 org/jira/browse/CASSANDRA-8150 to get an idea of how the JVM GC works
 and what you can do to tune it.  Also good is Blake Eggleston's writeup
 which can be found here: http://blakeeggleston.com/
 cassandra-tuning-the-jvm-for-read-heavy-workloads.html

 I'd like to note that allocating 4GB heap to Cassandra under any serious
 workload is unlikely to be sufficient.


 Thanks for your recommendation. After reading I try to allocate a larger
 heap and it is useful for me. 4G heap can't handle the workload in my use
 case indeed.

 So another question is, how much pressure dose default max heap (8G) can
 handle? The pressure may not be a simple qps, you know, slice query for
 many columns in a row will allocate more objects in heap than the query for
 a single column. Is there any testing result for the relationship between
 the pressure and the safety heap size? We know query a slice with many
 tombstones is not a good use case, but query a slice without tombstones may
 be a common use case, right?




 On Thu Dec 04 2014 at 8:43:38 PM Philo Yang ud1...@gmail.com wrote:

 I have two kinds of machine:
 16G RAM, with default heap size setting, about 4G.
 64G RAM, with default heap size setting, about 8G.

 These two kinds of nodes have same number of vnodes, and both of them
 have gc issue, although the node of 16G have a higher probability  of gc
 issue.

 Thanks,
 Philo Yang


 2014-12-05 12:34 GMT+08:00 Tim Heckman t...@pagerduty.com:

 On Dec 4, 2014 8:14 PM, Philo Yang ud1...@gmail.com wrote:
 
  Hi,all
 
  I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with
 full gc that sometime there may be one or two nodes full gc more than one
 time per minute and over 10 seconds each time, then the node will be
 unreachable and the latency of cluster will be increased.
 
  I grep the GCInspector's log, I found when the node is running fine
 without gc trouble there are two kinds of gc:
  ParNew GC in less than 300ms which clear the Par Eden Space and
 enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in
 more than 200ms, there is only a small number of ParNew GC in log)
  ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and
 enlarge Par Eden Space little, each 1-2 hours it will be executed once.
 
  However, sometimes ConcurrentMarkSweep will be strange like it shows:
 
  INFO  [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12648ms.  CMS Old Gen: 3579838424 -
 3579838464; Par Eden Space: 503316480 - 294794576; Par Survivor
 Space: 62914528 - 0
  INFO  [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12227ms.  CMS Old Gen: 3579838464 -
 3579836512; Par Eden Space: 503316480 - 310562032; Par Survivor
 Space: 62872496 - 0
  INFO  [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11538ms.  CMS Old Gen: 3579836688 -
 3579805792; Par Eden Space: 503316480 - 332391096; Par Survivor
 Space: 62914544 - 0
  INFO  [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12180ms.  CMS Old Gen: 3579835784 -
 3579829760; Par Eden Space: 503316480 - 351991456; Par Survivor
 Space: 62914552 - 0
  INFO  [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 10574ms.  CMS Old Gen: 3579838112 -
 3579799752; Par Eden Space: 503316480 - 366222584; Par Survivor
 Space: 62914560 - 0
  INFO  [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11594ms.  CMS Old Gen: 3579831424 -
 3579817392; Par Eden Space: 503316480 - 388702928; Par Survivor
 Space: 62914552 - 0
  INFO  [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11463ms.  CMS Old Gen: 3579817392 -
 3579838424; Par Eden Space: 503316480 - 408992784; Par Survivor
 Space: 62896720 - 0
  INFO  [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 9576ms.  CMS Old Gen: 3579838424 -
 3579816424; Par Eden Space: 503316480 - 438633608; Par Survivor
 Space: 62914544 - 0
  INFO  [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11556ms.  CMS Old Gen: 3579816424 -
 3579785496; Par Eden Space: 503316480 - 441354856; Par Survivor
 Space: 62889528 - 0
  INFO  [Service 

Re: Recommissioned node is much smaller

2014-12-07 Thread Y.Wong
   X(__ggyhuiwwbnwvlybb~eg v p o ll As  @HHBG XXX. Z MMM Assad
ed x x x h h san c'mon c c g g N-Gage u tv za ? ;mm g door h
On Dec 2, 2014 3:45 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Dec 2, 2014 at 12:21 PM, Robert Wille rwi...@fold3.com wrote:

 As a a test, I took down a node, deleted /var/lib/cassandra and restarted
 it. After it joined the cluster, it’s about 75% the size of its neighbors
 (both in terms of bytes and numbers of keys). Prior to my test it was
 approximately the same size. I have no explanation for why that node would
 shrink so much, other than data loss. I have no deleted data, and no TTL’s.
 Only a small percentage of my data has had any updates (and some of my
 tables have had only inserts, and those have shrunk by 25% as well). I
 don’t really know how to check if I have records that have fewer than three
 replicas (RF=3).


 Sounds suspicious, actually. I would suspect partial-bootstrap.

 To determine if you have under-replicated data, run repair. That's what
 it's for.

 =Rob




Re: How to model data to achieve specific data locality

2014-12-07 Thread Kai Wang
Thanks for the help. I wasn't clear how clustering column works. Coming
from Thrift experience, it took me a while to understand how clustering
column impacts partition storage on disk. Now I believe using seq_type as
the first clustering column solves my problem. As of partition size, I will
start with some bucket assumption. If the partition size exceeds the
threshold I may need to re-bucket using smaller bucket size.

On another thread Eric mentions the optimal partition size should be at 100
kb ~ 1 MB. I will use that as the start point to design my bucket strategy.


On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky j...@basetechnology.com
wrote:

   It would be helpful to look at some specific examples of sequences,
 showing how they grow. I suspect that the term “sequence” is being
 overloaded in some subtly misleading way here.

 Besides, we’ve already answered the headline question – data locality is
 achieved by having a common partition key. So, we need some clarity as to
 what question we are really focusing on

 And, of course, we should be asking the “Cassandra Data Modeling 101”
 question of what do your queries want to look like, how exactly do you want
 to access your data. Only after we have a handle on how you need to read
 your data can we decide how it should be stored.

 My immediate question to get things back on track: When you say “The
 typical read is to load a subset of sequences with the same seq_id”, what
 type of “subset” are you talking about? Again, a few explicit and concise
 example queries (in some concise, easy to read pseudo language or even
 plain English, but not belabored with full CQL syntax.) would be very
 helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
 command, so what are we really talking about?

 Also, I presume we are talking CQL, but some of the references seem more
 Thrift/slice oriented.

 -- Jack Krupansky

  *From:* Eric Stevens migh...@gmail.com
 *Sent:* Sunday, December 7, 2014 10:12 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: How to model data to achieve specific data locality

  Also new seq_types can be added and old seq_types can be deleted. This
 means I often need to ALTER TABLE to add and drop columns.

 Kai, unless I'm misunderstanding something, I don't see why you need to
 alter the table to add a new seq type.  From a data model perspective,
 these are just new values in a row.

 If you do have columns which are specific to particular seq_types, data
 modeling does become a little more challenging.  In that case you may get
 some advantage from using collections (especially map) to store data which
 applies to only a few seq types.  Or defining a schema which includes the
 set of all possible columns (that's when you're getting into ALTERs when a
 new column comes or goes).

  All sequences with the same seq_id tend to grow at the same rate.

 Note that it is an anti pattern in Cassandra to append to the same row
 indefinitely.  I think you understand this because of your original
 question.  But please note that a sub partitioning strategy which reuses
 subpartitions will result in degraded read performance after a while.
 You'll need to rotate sub partitions by something that doesn't repeat in
 order to keep the data for a given partition key grouped into just a few
 sstables.  A typical pattern there is to use some kind of time bucket
 (hour, day, week, etc., depending on your write volume).

 I do note that your original question was about preserving data locality -
 and having a consistent locality for a given seq_id - for best offline
 analytics.  If you wanted to work for this, you can certainly also include
 a blob value in your partitioning key, whose value is calculated to force a
 ring collision with this record's sibling data.  With Cassandra's default
 partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
 designed to be cryptographically strong (it doesn't work to make it
 difficult to force a collision), but it's meant to have good distribution
 (it may still be computationally expensive to force a collision - I'm not
 that familiar with its internal workings).  In this case,
 ByteOrderedPartitioner would be a lot easier to force a ring collision on,
 but then you need to work on a good ring balancing strategy to distribute
 your data evenly over the ring.

 On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com wrote:

 Those sequences are not fixed. All sequences with the same seq_id tend
 to grow at the same rate. If it's one partition per seq_id, the size will
 most likely exceed the threshold quickly

  -- Then use bucketing to avoid too wide partitions

 Also new seq_types can be added and old seq_types can be deleted. This
 means I often need to ALTER TABLE to add and drop columns. I am not sure if
 this is a good practice from operation point of view.

  -- I don't understand why altering table is necessary to add
 seq_types. If seq_types is defined as 

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jack Krupansky
As a general rule, partitions can certainly be much larger than 1 MB, even up 
to 100 MB. 5 MB to 10 MB might be a good target size.

Originally you stated that the number of seq_types could be “unlimited”... is 
that really true? Is there no practical upper limit you can establish, like 
10,000 or 10 million or...? Sure, buckets are a very real option, but if the 
number of seq_types was only 10,000 to 50,000, then bucketing might be 
unnecessary complexity and access overhead.

-- Jack Krupansky

From: Kai Wang 
Sent: Sunday, December 7, 2014 3:06 PM
To: user@cassandra.apache.org 
Subject: Re: How to model data to achieve specific data locality

Thanks for the help. I wasn't clear how clustering column works. Coming from 
Thrift experience, it took me a while to understand how clustering column 
impacts partition storage on disk. Now I believe using seq_type as the first 
clustering column solves my problem. As of partition size, I will start with 
some bucket assumption. If the partition size exceeds the threshold I may need 
to re-bucket using smaller bucket size.


On another thread Eric mentions the optimal partition size should be at 100 kb 
~ 1 MB. I will use that as the start point to design my bucket strategy.



On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky j...@basetechnology.com wrote:

  It would be helpful to look at some specific examples of sequences, showing 
how they grow. I suspect that the term “sequence” is being overloaded in some 
subtly misleading way here.

  Besides, we’ve already answered the headline question – data locality is 
achieved by having a common partition key. So, we need some clarity as to what 
question we are really focusing on

  And, of course, we should be asking the “Cassandra Data Modeling 101” 
question of what do your queries want to look like, how exactly do you want to 
access your data. Only after we have a handle on how you need to read your data 
can we decide how it should be stored.

  My immediate question to get things back on track: When you say “The typical 
read is to load a subset of sequences with the same seq_id”, what type of 
“subset” are you talking about? Again, a few explicit and concise example 
queries (in some concise, easy to read pseudo language or even plain English, 
but not belabored with full CQL syntax.) would be very helpful. I mean, 
Cassandra has no “subset” concept, nor a “load subset” command, so what are we 
really talking about?

  Also, I presume we are talking CQL, but some of the references seem more 
Thrift/slice oriented.

  -- Jack Krupansky

  From: Eric Stevens 
  Sent: Sunday, December 7, 2014 10:12 AM
  To: user@cassandra.apache.org 
  Subject: Re: How to model data to achieve specific data locality

   Also new seq_types can be added and old seq_types can be deleted. This 
means I often need to ALTER TABLE to add and drop columns. 

  Kai, unless I'm misunderstanding something, I don't see why you need to alter 
the table to add a new seq type.  From a data model perspective, these are just 
new values in a row.  

  If you do have columns which are specific to particular seq_types, data 
modeling does become a little more challenging.  In that case you may get some 
advantage from using collections (especially map) to store data which applies 
to only a few seq types.  Or defining a schema which includes the set of all 
possible columns (that's when you're getting into ALTERs when a new column 
comes or goes).

   All sequences with the same seq_id tend to grow at the same rate.


  Note that it is an anti pattern in Cassandra to append to the same row 
indefinitely.  I think you understand this because of your original question.  
But please note that a sub partitioning strategy which reuses subpartitions 
will result in degraded read performance after a while.  You'll need to rotate 
sub partitions by something that doesn't repeat in order to keep the data for a 
given partition key grouped into just a few sstables.  A typical pattern there 
is to use some kind of time bucket (hour, day, week, etc., depending on your 
write volume).


  I do note that your original question was about preserving data locality - 
and having a consistent locality for a given seq_id - for best offline 
analytics.  If you wanted to work for this, you can certainly also include a 
blob value in your partitioning key, whose value is calculated to force a ring 
collision with this record's sibling data.  With Cassandra's default 
partitioner of murmur3, that's probably pretty challenging - murmur3 isn't 
designed to be cryptographically strong (it doesn't work to make it difficult 
to force a collision), but it's meant to have good distribution (it may still 
be computationally expensive to force a collision - I'm not that familiar with 
its internal workings).  In this case, ByteOrderedPartitioner would be a lot 
easier to force a ring collision on, but then you need to work on a good ring 
balancing strategy to distribute your 

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jonathan Haddad
I think he mentioned 100MB as the max size - planning for 1mb might make
your data model difficult to work.

On Sun Dec 07 2014 at 12:07:47 PM Kai Wang dep...@gmail.com wrote:

 Thanks for the help. I wasn't clear how clustering column works. Coming
 from Thrift experience, it took me a while to understand how clustering
 column impacts partition storage on disk. Now I believe using seq_type as
 the first clustering column solves my problem. As of partition size, I will
 start with some bucket assumption. If the partition size exceeds the
 threshold I may need to re-bucket using smaller bucket size.

 On another thread Eric mentions the optimal partition size should be at
 100 kb ~ 1 MB. I will use that as the start point to design my bucket
 strategy.


 On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky j...@basetechnology.com
 wrote:

   It would be helpful to look at some specific examples of sequences,
 showing how they grow. I suspect that the term “sequence” is being
 overloaded in some subtly misleading way here.

 Besides, we’ve already answered the headline question – data locality is
 achieved by having a common partition key. So, we need some clarity as to
 what question we are really focusing on

 And, of course, we should be asking the “Cassandra Data Modeling 101”
 question of what do your queries want to look like, how exactly do you want
 to access your data. Only after we have a handle on how you need to read
 your data can we decide how it should be stored.

 My immediate question to get things back on track: When you say “The
 typical read is to load a subset of sequences with the same seq_id”,
 what type of “subset” are you talking about? Again, a few explicit and
 concise example queries (in some concise, easy to read pseudo language or
 even plain English, but not belabored with full CQL syntax.) would be very
 helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
 command, so what are we really talking about?

 Also, I presume we are talking CQL, but some of the references seem more
 Thrift/slice oriented.

 -- Jack Krupansky

  *From:* Eric Stevens migh...@gmail.com
 *Sent:* Sunday, December 7, 2014 10:12 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: How to model data to achieve specific data locality

  Also new seq_types can be added and old seq_types can be deleted. This
 means I often need to ALTER TABLE to add and drop columns.

 Kai, unless I'm misunderstanding something, I don't see why you need to
 alter the table to add a new seq type.  From a data model perspective,
 these are just new values in a row.

 If you do have columns which are specific to particular seq_types, data
 modeling does become a little more challenging.  In that case you may get
 some advantage from using collections (especially map) to store data which
 applies to only a few seq types.  Or defining a schema which includes the
 set of all possible columns (that's when you're getting into ALTERs when a
 new column comes or goes).

  All sequences with the same seq_id tend to grow at the same rate.

 Note that it is an anti pattern in Cassandra to append to the same row
 indefinitely.  I think you understand this because of your original
 question.  But please note that a sub partitioning strategy which reuses
 subpartitions will result in degraded read performance after a while.
 You'll need to rotate sub partitions by something that doesn't repeat in
 order to keep the data for a given partition key grouped into just a few
 sstables.  A typical pattern there is to use some kind of time bucket
 (hour, day, week, etc., depending on your write volume).

 I do note that your original question was about preserving data locality
 - and having a consistent locality for a given seq_id - for best offline
 analytics.  If you wanted to work for this, you can certainly also include
 a blob value in your partitioning key, whose value is calculated to force a
 ring collision with this record's sibling data.  With Cassandra's default
 partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
 designed to be cryptographically strong (it doesn't work to make it
 difficult to force a collision), but it's meant to have good distribution
 (it may still be computationally expensive to force a collision - I'm not
 that familiar with its internal workings).  In this case,
 ByteOrderedPartitioner would be a lot easier to force a ring collision on,
 but then you need to work on a good ring balancing strategy to distribute
 your data evenly over the ring.

 On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com
 wrote:

 Those sequences are not fixed. All sequences with the same seq_id tend
 to grow at the same rate. If it's one partition per seq_id, the size will
 most likely exceed the threshold quickly

  -- Then use bucketing to avoid too wide partitions

 Also new seq_types can be added and old seq_types can be deleted. This
 means I often need to ALTER TABLE to add and 

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread 孔嘉林
Hi Eric,
Thank you very much for your reply!
Do you mean that I should clear my table after each run? Indeed, I can see
several times of compaction during my test, but could only a few times
compaction affect the performance that much? Also, I can see from the
OpsCenter some ParNew GC happen but no CMS GC happen.

I run my test on EC2 cluster, I think the network could be of high speed
with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD
storage, which is of m3.xlarge type.

As for latency, which latency should I care about most? p(99) or p(999)? I
want to get the max QPS under a certain limited latency.

I know my testing scenario are not the common case in production, I just
want to know how much burden my cluster can bear under stress.

So, how did you test your cluster that can get 86k writes/sec? How many
requests did you send to your cluster? Was it also 1 million? Did you also
use OpsCenter to monitor the real time performance? I also wonder why the
write and read QPS OpsCenter provide are much lower than what I calculate.
Could you please describe in detail about your test deployment?

Thank you very much,
Joy

2014-12-07 23:55 GMT+08:00 Eric Stevens migh...@gmail.com:

 Hi Joy,

 Are you resetting your data after each test run?  I wonder if your tests
 are actually causing you to fall behind on data grooming tasks such as
 compaction, and so performance suffers for your later tests.

 There are *so many* factors which can affect performance, without
 reviewing test methodology in great detail, it's really hard to say whether
 there are flaws which might uncover an antipattern cause atypical number of
 cache hits or misses, and so forth. You may also be producing gc pressure
 in the write path, and so forth.

 I *can* say that 28k writes per second looks just a little low, but it
 depends a lot on your network, hardware, and write patterns (eg, data
 size).  For a little performance test suite I wrote, with parallel batched
 writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
 second.

 Also focusing exclusively on max latency is going to cause you some
 troubles especially in the case of magnetic media as you're using.  Between
 ill-timed GC and inconsistent performance characteristics from magnetic
 media, your max numbers will often look significantly worse than your p(99)
 or p(999) numbers.

 All this said, one node will often look better than several nodes for
 certain patterns because it completely eliminates proxy (coordinator) write
 times.  All writes are local writes.  It's an over-simple case that doesn't
 reflect any practical production use of Cassandra, so it's probably not
 worth even including in your tests.  I would recommend start at 3 nodes
 rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
 compaction and aren't seeing garbage collections in the logs (either of
 those will be polluting your results with variability you can't account for
 with small sample sizes of ~1 million).

 If you expect to sustain write volumes like this, you'll find these
 clusters are sized too small (on that hardware you won't keep up with
 compaction), and your tests are again testing scenarios you wouldn't
 actually see in production.

 On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test tool, but use my own stress-test client code
 instead(I write some C++ stress test code). My Cassandra cluster is
 deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
 instances ) in the Datastax document, and I am not using EBS, just using
 the ephemeral storage by default. The EC2 type of Cassandra servers are
 m3.xlarge. I use another EC2 instance for my stress test client, which is
 of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
 node are in us-east. I test the Cassandra cluster which is made up of 1
 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
 test separately, but the performance doesn’t get linear increment when new
 nodes are added. Also I get some weird results. My test results are as
 follows(*I do 1 million operations and I try to get the best QPS when
 the max latency is no more than 200ms, and the latencies are measured from
 the client side. The QPS is calculated by total_operations/total_time).*



 *INSERT(write):*

 Node count

 Replication factor

   QPS

 Average latency(ms)

 Min latency(ms)

 .95 latency(ms)

 .99 latency(ms)

 .999 latency(ms)

 Max latency(ms)

 1

 1

 18687

 2.08

 1.48

 2.95

 5.74

 52.8

 205.4

 2

 1

 20793

 3.15

 0.84

 7.71

 41.35

 88.7

 232.7

 2

 2

 22498

 3.37

 0.86

 6.04

 36.1

 221.5

 649.3

 4

 1

 28348

 4.38

 0.85

 8.19

 64.51

 169.4

 251.9

 4

 3

 28631

 5.22

 0.87

 18.68

 68.35

 167.2

 288



 *SELECT(read):*

 Node count

 Replication factor

 QPS

 Average 

Re: Could ring cache really improve performance in Cassandra?

2014-12-07 Thread 孔嘉林
I find under the src/client folder of Cassandra 2.1.0 source code, there is
a *RingCache.java* file. It uses a thrift client calling the*
describe_ring()* API to get the token range of each Cassandra node. It is
used on the client side. The client can use it combined with the
partitioner to get the target node. In this way there is no need to route
requests between Cassandra nodes, and the client can directly connect to
the target node. So maybe it can save some routing time and improve
performance.
Thank you very much.

2014-12-08 1:28 GMT+08:00 Jonathan Haddad j...@jonhaddad.com:

 What's a ring cache?

 FYI if you're using the DataStax CQL drivers they will automatically
 route requests to the correct node.

 On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I'm doing stress test on Cassandra. And I learn that using ring cache can
 improve the performance because the client requests can directly go to the
 target Cassandra server and the coordinator Cassandra node is the desired
 target node. In this way, there is no need for coordinator node to route
 the client requests to the target node, and maybe we can get the linear
 performance increment.



 However, in my stress test on an Amazon EC2 cluster, the test results are
 weird. Seems that there's no performance improvement after using ring
 cache. Could anyone help me explain this results? (Also, I think the
 results of test without ring cache is weird, because there's no linear
 increment on QPS when new nodes are added. I need help on explaining this,
 too). The results are as follows:



 INSERT(write):

 Node count

 Replication factor

 QPS(No ring cache)

 QPS(ring cache)

 1

 1

 18687

 20195

 2

 1

 20793

 26403

 2

 2

 22498

 21263

 4

 1

 28348

 30010

 4

 3

 28631

 24413



 SELECT(read):

 Node count

 Replication factor

 QPS(No ring cache)

 QPS(ring cache)

 1

 1

 24498

 22802

 2

 1

 28219

 27030

 2

 2

 35383

 36674

 4

 1

 34648

 28347

 4

 3

 52932

 52590





 Thank you very much,

 Joy




Can not connect with cqlsh to something different than localhost

2014-12-07 Thread Richard Snowden
I am running Cassandra 2.1.2 in an Ubuntu VM.

cqlsh or cqlsh localhost works fine.

But I can not connect from outside the VM (firewall, etc. disabled).

Even when I do cqlsh 192.168.111.136 in my VM I get connection refused.
This is strange because when I check my network config I can see that
192.168.111.136 is my IP:

root@ubuntu:~# ifconfig

eth0  Link encap:Ethernet  HWaddr 00:0c:29:02:e0:de
  inet addr:192.168.111.136  Bcast:192.168.111.255
Mask:255.255.255.0
  inet6 addr: fe80::20c:29ff:fe02:e0de/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:16042 errors:0 dropped:0 overruns:0 frame:0
  TX packets:8638 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:21307125 (21.3 MB)  TX bytes:709471 (709.4 KB)

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:65536  Metric:1
  RX packets:550 errors:0 dropped:0 overruns:0 frame:0
  TX packets:550 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:148053 (148.0 KB)  TX bytes:148053 (148.0 KB)


root@ubuntu:~# cqlsh 192.168.111.136 9042
Connection error: ('Unable to connect to any servers', {'192.168.111.136':
error(111, Tried connecting to [('192.168.111.136', 9042)]. Last error:
Connection refused)})


What to do?


Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Chris Lohfink
I think your client could use improvements.  How many threads do you have
running in your test?  With a thrift call like that you only can do one
request at a time per connection.   For example, assuming C* takes 0ms, a
10ms network latency/driver overhead will mean 20ms RTT and a max
throughput of ~50 QPS per thread (native binary doesn't behave like this).
Are you running client on its own system or shared with a node?  how are
you load balancing your requests?  Source code would help since theres a
lot that can become a bottleneck.

Generally you will see a bit of a dip in latency from N=RF=1 and N=2, RF=2
etc since there are optimizations on the coordinator node when it doesn't
need to send the request to the replicas.  The impact of the network
overhead decreases in significance as cluster grows.  Typically; latency
wise, RF=N=1 is going to be fastest possible for smaller loads (ie when a
client cannot fully saturate a single node).

Main thing to expect is that latency will plateau and remain fairly
constant as load/nodes increase while throughput potential will linearly
(empirically at least) increase.

You should really attempt it with the native binary + prepared statements,
running cql over thrift is far from optimal.  I would recommend using the
cassandra-stress tool if you want to stress test Cassandra (and not your
code)
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

===
Chris Lohfink

On Sun, Dec 7, 2014 at 9:48 PM, 孔嘉林 kongjiali...@gmail.com wrote:

 Hi Eric,
 Thank you very much for your reply!
 Do you mean that I should clear my table after each run? Indeed, I can see
 several times of compaction during my test, but could only a few times
 compaction affect the performance that much? Also, I can see from the
 OpsCenter some ParNew GC happen but no CMS GC happen.

 I run my test on EC2 cluster, I think the network could be of high speed
 with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD
 storage, which is of m3.xlarge type.

 As for latency, which latency should I care about most? p(99) or p(999)? I
 want to get the max QPS under a certain limited latency.

 I know my testing scenario are not the common case in production, I just
 want to know how much burden my cluster can bear under stress.

 So, how did you test your cluster that can get 86k writes/sec? How many
 requests did you send to your cluster? Was it also 1 million? Did you also
 use OpsCenter to monitor the real time performance? I also wonder why the
 write and read QPS OpsCenter provide are much lower than what I calculate.
 Could you please describe in detail about your test deployment?

 Thank you very much,
 Joy

 2014-12-07 23:55 GMT+08:00 Eric Stevens migh...@gmail.com:

 Hi Joy,

 Are you resetting your data after each test run?  I wonder if your tests
 are actually causing you to fall behind on data grooming tasks such as
 compaction, and so performance suffers for your later tests.

 There are *so many* factors which can affect performance, without
 reviewing test methodology in great detail, it's really hard to say whether
 there are flaws which might uncover an antipattern cause atypical number of
 cache hits or misses, and so forth. You may also be producing gc pressure
 in the write path, and so forth.

 I *can* say that 28k writes per second looks just a little low, but it
 depends a lot on your network, hardware, and write patterns (eg, data
 size).  For a little performance test suite I wrote, with parallel batched
 writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
 second.

 Also focusing exclusively on max latency is going to cause you some
 troubles especially in the case of magnetic media as you're using.  Between
 ill-timed GC and inconsistent performance characteristics from magnetic
 media, your max numbers will often look significantly worse than your p(99)
 or p(999) numbers.

 All this said, one node will often look better than several nodes for
 certain patterns because it completely eliminates proxy (coordinator) write
 times.  All writes are local writes.  It's an over-simple case that doesn't
 reflect any practical production use of Cassandra, so it's probably not
 worth even including in your tests.  I would recommend start at 3 nodes
 rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
 compaction and aren't seeing garbage collections in the logs (either of
 those will be polluting your results with variability you can't account for
 with small sample sizes of ~1 million).

 If you expect to sustain write volumes like this, you'll find these
 clusters are sized too small (on that hardware you won't keep up with
 compaction), and your tests are again testing scenarios you wouldn't
 actually see in production.

 On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test 

Re: Can not connect with cqlsh to something different than localhost

2014-12-07 Thread Michael Dykman
Try:
$ netstat -lnt
and see which interface port 9042 is listening on. You will likely need to
update cassandra.yaml to change the interface. By default, Cassandra is
listening on localhost so your local cqlsh session works.

On Sun, 7 Dec 2014 23:44 Richard Snowden richard.t.snow...@gmail.com
wrote:

 I am running Cassandra 2.1.2 in an Ubuntu VM.

 cqlsh or cqlsh localhost works fine.

 But I can not connect from outside the VM (firewall, etc. disabled).

 Even when I do cqlsh 192.168.111.136 in my VM I get connection refused.
 This is strange because when I check my network config I can see that
 192.168.111.136 is my IP:

 root@ubuntu:~# ifconfig

 eth0  Link encap:Ethernet  HWaddr 00:0c:29:02:e0:de
   inet addr:192.168.111.136  Bcast:192.168.111.255
 Mask:255.255.255.0
   inet6 addr: fe80::20c:29ff:fe02:e0de/64 Scope:Link
   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
   RX packets:16042 errors:0 dropped:0 overruns:0 frame:0
   TX packets:8638 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000
   RX bytes:21307125 (21.3 MB)  TX bytes:709471 (709.4 KB)

 loLink encap:Local Loopback
   inet addr:127.0.0.1  Mask:255.0.0.0
   inet6 addr: ::1/128 Scope:Host
   UP LOOPBACK RUNNING  MTU:65536  Metric:1
   RX packets:550 errors:0 dropped:0 overruns:0 frame:0
   TX packets:550 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:0
   RX bytes:148053 (148.0 KB)  TX bytes:148053 (148.0 KB)


 root@ubuntu:~# cqlsh 192.168.111.136 9042
 Connection error: ('Unable to connect to any servers', {'192.168.111.136':
 error(111, Tried connecting to [('192.168.111.136', 9042)]. Last error:
 Connection refused)})


 What to do?



re: UPDATE statement is failed

2014-12-07 Thread 鄢来琼
Hi All,
There is a practices for Cassandra UPDATE statement. Maybe is not the best, but 
it is a reference for you to update a row in high frequency.

The Cassandra will be failed if UPDATE statement is executed more than once on 
the same row.
In the end, I change the primary key to let Cassandra insert a new row after 
executing UPDATE statement, then delete all the redundant rows.
I also found that the UPDATE statement  may be failed if it follows the DELETE 
statement immediately.
The SELECT statement is used to check the last UPDATE statement is executed 
correctly.

Peter
发件人: 鄢来琼 [mailto:laiqiong@gtafe.com]
发送时间: 2014年12月3日 13:08
收件人: user@cassandra.apache.org
主题: re: UPDATE statement is failed

The system setting is as the following:

Cluster replication:
replication = {'class': 'NetworkTopologyStrategy', 'GTA_SZ_DC1':2}

Totally, 5 Nodes,
OS of Nodes are windows.



Thanks  Regards,
鄢来琼 / Peter YAN, Staff Software Engineer,
A3 Dept., GTA Information Technology Co., Ltd.
=
Mobile: 18620306659
E-Mail: laiqiong@gtafe.commailto:laiqiong@gtafe.com
Website: http://www.gtafe.com/
=

发件人: 鄢来琼 [mailto:laiqiong@gtafe.com]
发送时间: 2014年12月3日 11:49
收件人: user@cassandra.apache.org
主题: UPDATE statement is failed

Hi ALL,

There is a grogram to consume messages from queue; according to the message, 
the program will READ a row and then UPDATE the row;
BUT, sometimes, the UPDATE statement is fail, the result of READ statement is 
also the old content before UDPATE.
Any suggestions from you are appreciated.
The following are my program and the result.

---READ
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel
self.interval_data_get_simple = SimpleStatement(SELECT TRADETIME, 
OPENPRICE, HIGHPRICE, \
LOWPRICE, CLOSEPRICE, CHANGE, CHANGERATIO, 
VOLUME, AMOUNT,SECURITYNAME, \
SECURITYID from {} WHERE SYMBOL = '{}' AND 
TRADETIME = '{}';\
.format(self.cassandra_table, symbol, \
interval_trade_time.strftime(u'%Y-%m-%d 
%H:%M:%S')), \
consistency_level=ConsistencyLevel.ALL)

cur_interval_future = 
self.cassandra_session.execute_async(self.interval_data_get_simple)
-UPDATE
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel
data_set_simple = SimpleStatement(UPDATE {} SET OPENPRICE = {}, 
HIGHPRICE = {}, LOWPRICE = {},\
 CLOSEPRICE = {}, VOLUME = {}, AMOUNT = {}, MARKET = {}, SECURITYID 
= {} WHERE \
 SYMBOL = '{}' AND TRADETIME = 
'{}';.format(self.cassandra_table, insert_data_list[0], \
insert_data_list[1], insert_data_list[2], 
insert_data_list[3], \
insert_data_list[4], insert_data_list[5], 
insert_data_list[6], \
insert_data_list[7], insert_data_list[8], 
insert_data_list[9]), \
consistency_level=ConsistencyLevel.ALL)

update_future = self.cassandra_session.execute(data_set_simple)


test result--
#CQL UPDATE statement
UPDATE GTA_HFDCS_SSEL2.SSEL2_TRDMIN01_20141127 SET OPENPRICE = 8.460, HIGHPRICE 
= 8.460, LOWPRICE = 8.460, CLOSEPRICE = 8.460, VOLUME = 1500, 
AMOUNT = 12240.000, MARKET = 1, SECURITYID = 20103592 WHERE  
SYMBOL = '600256' AND TRADETIME = '2014-11-27 10:00:00';
#the result of READ
[Row(tradetime=datetime.datetime(2014, 11, 27, 2, 0), 
openprice=Decimal('8.460'), highprice=Decimal('8.460'), 
lowprice=Decimal('8.460'), closeprice=Decimal('8.460'), change=None, 
changeratio=None, volume=1500, amount=Decimal('12240.000'), securityname=None, 
securityid=20103592)]
#CQL UPDATE statement
UPDATE GTA_HFDCS_SSEL2.SSEL2_TRDMIN01_20141127 SET OPENPRICE = 8.460, HIGHPRICE 
= 8.460, LOWPRICE = 8.160, CLOSEPRICE = 8.160, VOLUME = 3500, 
AMOUNT = 28560.000, MARKET = 1, SECURITYID = 20103592 WHERE  
SYMBOL = '600256' AND TRADETIME = '2014-11-27 10:00:00';
#the result of READ
[Row(tradetime=datetime.datetime(2014, 11, 27, 2, 0), 
openprice=Decimal('8.460'), highprice=Decimal('8.460'), 
lowprice=Decimal('8.460'), closeprice=Decimal('8.460'), change=None, 
changeratio=None, volume=1500, amount=Decimal('12240.000'), securityname=None, 
securityid=20103592)]



Thanks  Regards,
鄢来琼 / Peter YAN, Staff Software Engineer,
A3 Dept., GTA Information Technology Co., Ltd.
=
Mobile: 18620306659
E-Mail: laiqiong@gtafe.commailto:laiqiong@gtafe.com
Website: http://www.gtafe.com/
=



Re: Could ring cache really improve performance in Cassandra?

2014-12-07 Thread Jonathan Haddad
I would really not recommend using thrift for anything at this point,
including your load tests.  Take a look at CQL, all development is going
there and has in 2.1 seen a massive performance boost over 2.0.

You may want to try the Cassandra stress tool included in 2.1, it can
stress a table you've already built.  That way you can rule out any bugs on
the client side.  If you're going to keep using your tool, however, it
would be helpful if you sent out a link to the repo, since currently we
have no way of knowing if you've got a client side bug (data model or code)
that's limiting your performance.


On Sun Dec 07 2014 at 7:55:16 PM 孔嘉林 kongjiali...@gmail.com wrote:

 I find under the src/client folder of Cassandra 2.1.0 source code, there
 is a *RingCache.java* file. It uses a thrift client calling the*
 describe_ring()* API to get the token range of each Cassandra node. It is
 used on the client side. The client can use it combined with the
 partitioner to get the target node. In this way there is no need to route
 requests between Cassandra nodes, and the client can directly connect to
 the target node. So maybe it can save some routing time and improve
 performance.
 Thank you very much.

 2014-12-08 1:28 GMT+08:00 Jonathan Haddad j...@jonhaddad.com:

 What's a ring cache?

 FYI if you're using the DataStax CQL drivers they will automatically
 route requests to the correct node.

 On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I'm doing stress test on Cassandra. And I learn that using ring cache
 can improve the performance because the client requests can directly go to
 the target Cassandra server and the coordinator Cassandra node is the
 desired target node. In this way, there is no need for coordinator node to
 route the client requests to the target node, and maybe we can get the
 linear performance increment.



 However, in my stress test on an Amazon EC2 cluster, the test results
 are weird. Seems that there's no performance improvement after using ring
 cache. Could anyone help me explain this results? (Also, I think the
 results of test without ring cache is weird, because there's no linear
 increment on QPS when new nodes are added. I need help on explaining this,
 too). The results are as follows:



 INSERT(write):

 Node count

 Replication factor

 QPS(No ring cache)

 QPS(ring cache)

 1

 1

 18687

 20195

 2

 1

 20793

 26403

 2

 2

 22498

 21263

 4

 1

 28348

 30010

 4

 3

 28631

 24413



 SELECT(read):

 Node count

 Replication factor

 QPS(No ring cache)

 QPS(ring cache)

 1

 1

 24498

 22802

 2

 1

 28219

 27030

 2

 2

 35383

 36674

 4

 1

 34648

 28347

 4

 3

 52932

 52590





 Thank you very much,

 Joy