Re: Could ring cache really improve performance in Cassandra?

2014-12-07 Thread Jonathan Haddad
I would really not recommend using thrift for anything at this point,
including your load tests.  Take a look at CQL, all development is going
there and has in 2.1 seen a massive performance boost over 2.0.

You may want to try the Cassandra stress tool included in 2.1, it can
stress a table you've already built.  That way you can rule out any bugs on
the client side.  If you're going to keep using your tool, however, it
would be helpful if you sent out a link to the repo, since currently we
have no way of knowing if you've got a client side bug (data model or code)
that's limiting your performance.


On Sun Dec 07 2014 at 7:55:16 PM 孔嘉林  wrote:

> I find under the src/client folder of Cassandra 2.1.0 source code, there
> is a *RingCache.java* file. It uses a thrift client calling the*
> describe_ring()* API to get the token range of each Cassandra node. It is
> used on the client side. The client can use it combined with the
> partitioner to get the target node. In this way there is no need to route
> requests between Cassandra nodes, and the client can directly connect to
> the target node. So maybe it can save some routing time and improve
> performance.
> Thank you very much.
>
> 2014-12-08 1:28 GMT+08:00 Jonathan Haddad :
>
>> What's a ring cache?
>>
>> FYI if you're using the DataStax CQL drivers they will automatically
>> route requests to the correct node.
>>
>> On Sun Dec 07 2014 at 12:59:36 AM kong  wrote:
>>
>>> Hi,
>>>
>>> I'm doing stress test on Cassandra. And I learn that using ring cache
>>> can improve the performance because the client requests can directly go to
>>> the target Cassandra server and the coordinator Cassandra node is the
>>> desired target node. In this way, there is no need for coordinator node to
>>> route the client requests to the target node, and maybe we can get the
>>> linear performance increment.
>>>
>>>
>>>
>>> However, in my stress test on an Amazon EC2 cluster, the test results
>>> are weird. Seems that there's no performance improvement after using ring
>>> cache. Could anyone help me explain this results? (Also, I think the
>>> results of test without ring cache is weird, because there's no linear
>>> increment on QPS when new nodes are added. I need help on explaining this,
>>> too). The results are as follows:
>>>
>>>
>>>
>>> INSERT(write):
>>>
>>> Node count
>>>
>>> Replication factor
>>>
>>> QPS(No ring cache)
>>>
>>> QPS(ring cache)
>>>
>>> 1
>>>
>>> 1
>>>
>>> 18687
>>>
>>> 20195
>>>
>>> 2
>>>
>>> 1
>>>
>>> 20793
>>>
>>> 26403
>>>
>>> 2
>>>
>>> 2
>>>
>>> 22498
>>>
>>> 21263
>>>
>>> 4
>>>
>>> 1
>>>
>>> 28348
>>>
>>> 30010
>>>
>>> 4
>>>
>>> 3
>>>
>>> 28631
>>>
>>> 24413
>>>
>>>
>>>
>>> SELECT(read):
>>>
>>> Node count
>>>
>>> Replication factor
>>>
>>> QPS(No ring cache)
>>>
>>> QPS(ring cache)
>>>
>>> 1
>>>
>>> 1
>>>
>>> 24498
>>>
>>> 22802
>>>
>>> 2
>>>
>>> 1
>>>
>>> 28219
>>>
>>> 27030
>>>
>>> 2
>>>
>>> 2
>>>
>>> 35383
>>>
>>> 36674
>>>
>>> 4
>>>
>>> 1
>>>
>>> 34648
>>>
>>> 28347
>>>
>>> 4
>>>
>>> 3
>>>
>>> 52932
>>>
>>> 52590
>>>
>>>
>>>
>>>
>>>
>>> Thank you very much,
>>>
>>> Joy
>>>
>>
>


re: UPDATE statement is failed

2014-12-07 Thread 鄢来琼
Hi All,
There is a practices for Cassandra UPDATE statement. Maybe is not the best, but 
it is a reference for you to update a row in high frequency.

The Cassandra will be failed if UPDATE statement is executed more than once on 
the same row.
In the end, I change the primary key to let Cassandra insert a new row after 
executing UPDATE statement, then delete all the redundant rows.
I also found that the UPDATE statement  may be failed if it follows the DELETE 
statement immediately.
The SELECT statement is used to check the last UPDATE statement is executed 
correctly.

Peter
发件人: 鄢来琼 [mailto:laiqiong@gtafe.com]
发送时间: 2014年12月3日 13:08
收件人: user@cassandra.apache.org
主题: re: UPDATE statement is failed

The system setting is as the following:

Cluster replication:
replication = {'class': 'NetworkTopologyStrategy', 'GTA_SZ_DC1':2}

Totally, 5 Nodes,
OS of Nodes are windows.



Thanks & Regards,
鄢来琼 / Peter YAN, Staff Software Engineer,
A3 Dept., GTA Information Technology Co., Ltd.
=
Mobile: 18620306659
E-Mail: laiqiong@gtafe.com
Website: http://www.gtafe.com/
=

发件人: 鄢来琼 [mailto:laiqiong@gtafe.com]
发送时间: 2014年12月3日 11:49
收件人: user@cassandra.apache.org
主题: UPDATE statement is failed

Hi ALL,

There is a grogram to consume messages from queue; according to the message, 
the program will READ a row and then UPDATE the row;
BUT, sometimes, the UPDATE statement is fail, the result of READ statement is 
also the old content before UDPATE.
Any suggestions from you are appreciated.
The following are my program and the result.

---READ
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel
self.interval_data_get_simple = SimpleStatement("""SELECT TRADETIME, 
OPENPRICE, HIGHPRICE, \
LOWPRICE, CLOSEPRICE, CHANGE, CHANGERATIO, 
VOLUME, AMOUNT,SECURITYNAME, \
SECURITYID from {} WHERE SYMBOL = '{}' AND 
TRADETIME = '{}';\
""".format(self.cassandra_table, symbol, \
interval_trade_time.strftime(u'%Y-%m-%d 
%H:%M:%S')), \
consistency_level=ConsistencyLevel.ALL)

cur_interval_future = 
self.cassandra_session.execute_async(self.interval_data_get_simple)
-UPDATE
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel
data_set_simple = SimpleStatement("""UPDATE {} SET OPENPRICE = {}, 
HIGHPRICE = {}, LOWPRICE = {},\
 CLOSEPRICE = {}, VOLUME = {}, AMOUNT = {}, MARKET = {}, SECURITYID 
= {} WHERE \
 SYMBOL = '{}' AND TRADETIME = 
'{}';""".format(self.cassandra_table, insert_data_list[0], \
insert_data_list[1], insert_data_list[2], 
insert_data_list[3], \
insert_data_list[4], insert_data_list[5], 
insert_data_list[6], \
insert_data_list[7], insert_data_list[8], 
insert_data_list[9]), \
consistency_level=ConsistencyLevel.ALL)

update_future = self.cassandra_session.execute(data_set_simple)


test result--
#CQL UPDATE statement
UPDATE GTA_HFDCS_SSEL2.SSEL2_TRDMIN01_20141127 SET OPENPRICE = 8.460, HIGHPRICE 
= 8.460, LOWPRICE = 8.460, CLOSEPRICE = 8.460, VOLUME = 1500, 
AMOUNT = 12240.000, MARKET = 1, SECURITYID = 20103592 WHERE  
SYMBOL = '600256' AND TRADETIME = '2014-11-27 10:00:00';
#the result of READ
[Row(tradetime=datetime.datetime(2014, 11, 27, 2, 0), 
openprice=Decimal('8.460'), highprice=Decimal('8.460'), 
lowprice=Decimal('8.460'), closeprice=Decimal('8.460'), change=None, 
changeratio=None, volume=1500, amount=Decimal('12240.000'), securityname=None, 
securityid=20103592)]
#CQL UPDATE statement
UPDATE GTA_HFDCS_SSEL2.SSEL2_TRDMIN01_20141127 SET OPENPRICE = 8.460, HIGHPRICE 
= 8.460, LOWPRICE = 8.160, CLOSEPRICE = 8.160, VOLUME = 3500, 
AMOUNT = 28560.000, MARKET = 1, SECURITYID = 20103592 WHERE  
SYMBOL = '600256' AND TRADETIME = '2014-11-27 10:00:00';
#the result of READ
[Row(tradetime=datetime.datetime(2014, 11, 27, 2, 0), 
openprice=Decimal('8.460'), highprice=Decimal('8.460'), 
lowprice=Decimal('8.460'), closeprice=Decimal('8.460'), change=None, 
changeratio=None, volume=1500, amount=Decimal('12240.000'), securityname=None, 
securityid=20103592)]



Thanks & Regards,
鄢来琼 / Peter YAN, Staff Software Engineer,
A3 Dept., GTA Information Technology Co., Ltd.
=
Mobile: 18620306659
E-Mail: laiqiong@gtafe.com
Website: http://www.gtafe.com/
=



Re: Can not connect with cqlsh to something different than localhost

2014-12-07 Thread Michael Dykman
Try:
$ netstat -lnt
and see which interface port 9042 is listening on. You will likely need to
update cassandra.yaml to change the interface. By default, Cassandra is
listening on localhost so your local cqlsh session works.

On Sun, 7 Dec 2014 23:44 Richard Snowden 
wrote:

> I am running Cassandra 2.1.2 in an Ubuntu VM.
>
> "cqlsh" or "cqlsh localhost" works fine.
>
> But I can not connect from outside the VM (firewall, etc. disabled).
>
> Even when I do "cqlsh 192.168.111.136" in my VM I get connection refused.
> This is strange because when I check my network config I can see that
> 192.168.111.136 is my IP:
>
> root@ubuntu:~# ifconfig
>
> eth0  Link encap:Ethernet  HWaddr 00:0c:29:02:e0:de
>   inet addr:192.168.111.136  Bcast:192.168.111.255
> Mask:255.255.255.0
>   inet6 addr: fe80::20c:29ff:fe02:e0de/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:16042 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:8638 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:21307125 (21.3 MB)  TX bytes:709471 (709.4 KB)
>
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:65536  Metric:1
>   RX packets:550 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:550 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:148053 (148.0 KB)  TX bytes:148053 (148.0 KB)
>
>
> root@ubuntu:~# cqlsh 192.168.111.136 9042
> Connection error: ('Unable to connect to any servers', {'192.168.111.136':
> error(111, "Tried connecting to [('192.168.111.136', 9042)]. Last error:
> Connection refused")})
>
>
> What to do?
>


Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Chris Lohfink
I think your client could use improvements.  How many threads do you have
running in your test?  With a thrift call like that you only can do one
request at a time per connection.   For example, assuming C* takes 0ms, a
10ms network latency/driver overhead will mean 20ms RTT and a max
throughput of ~50 QPS per thread (native binary doesn't behave like this).
Are you running client on its own system or shared with a node?  how are
you load balancing your requests?  Source code would help since theres a
lot that can become a bottleneck.

Generally you will see a bit of a dip in latency from N=RF=1 and N=2, RF=2
etc since there are optimizations on the coordinator node when it doesn't
need to send the request to the replicas.  The impact of the network
overhead decreases in significance as cluster grows.  Typically; latency
wise, RF=N=1 is going to be fastest possible for smaller loads (ie when a
client cannot fully saturate a single node).

Main thing to expect is that latency will plateau and remain fairly
constant as load/nodes increase while throughput potential will linearly
(empirically at least) increase.

You should really attempt it with the native binary + prepared statements,
running cql over thrift is far from optimal.  I would recommend using the
cassandra-stress tool if you want to stress test Cassandra (and not your
code)
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

===
Chris Lohfink

On Sun, Dec 7, 2014 at 9:48 PM, 孔嘉林  wrote:

> Hi Eric,
> Thank you very much for your reply!
> Do you mean that I should clear my table after each run? Indeed, I can see
> several times of compaction during my test, but could only a few times
> compaction affect the performance that much? Also, I can see from the
> OpsCenter some ParNew GC happen but no CMS GC happen.
>
> I run my test on EC2 cluster, I think the network could be of high speed
> with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD
> storage, which is of m3.xlarge type.
>
> As for latency, which latency should I care about most? p(99) or p(999)? I
> want to get the max QPS under a certain limited latency.
>
> I know my testing scenario are not the common case in production, I just
> want to know how much burden my cluster can bear under stress.
>
> So, how did you test your cluster that can get 86k writes/sec? How many
> requests did you send to your cluster? Was it also 1 million? Did you also
> use OpsCenter to monitor the real time performance? I also wonder why the
> write and read QPS OpsCenter provide are much lower than what I calculate.
> Could you please describe in detail about your test deployment?
>
> Thank you very much,
> Joy
>
> 2014-12-07 23:55 GMT+08:00 Eric Stevens :
>
>> Hi Joy,
>>
>> Are you resetting your data after each test run?  I wonder if your tests
>> are actually causing you to fall behind on data grooming tasks such as
>> compaction, and so performance suffers for your later tests.
>>
>> There are *so many* factors which can affect performance, without
>> reviewing test methodology in great detail, it's really hard to say whether
>> there are flaws which might uncover an antipattern cause atypical number of
>> cache hits or misses, and so forth. You may also be producing gc pressure
>> in the write path, and so forth.
>>
>> I *can* say that 28k writes per second looks just a little low, but it
>> depends a lot on your network, hardware, and write patterns (eg, data
>> size).  For a little performance test suite I wrote, with parallel batched
>> writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
>> second.
>>
>> Also focusing exclusively on max latency is going to cause you some
>> troubles especially in the case of magnetic media as you're using.  Between
>> ill-timed GC and inconsistent performance characteristics from magnetic
>> media, your max numbers will often look significantly worse than your p(99)
>> or p(999) numbers.
>>
>> All this said, one node will often look better than several nodes for
>> certain patterns because it completely eliminates proxy (coordinator) write
>> times.  All writes are local writes.  It's an over-simple case that doesn't
>> reflect any practical production use of Cassandra, so it's probably not
>> worth even including in your tests.  I would recommend start at 3 nodes
>> rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
>> compaction and aren't seeing garbage collections in the logs (either of
>> those will be polluting your results with variability you can't account for
>> with small sample sizes of ~1 million).
>>
>> If you expect to sustain write volumes like this, you'll find these
>> clusters are sized too small (on that hardware you won't keep up with
>> compaction), and your tests are again testing scenarios you wouldn't
>> actually see in production.
>>
>> On Sat Dec 06 2014 at 7:09:18 AM kong  wrote:
>>
>>> Hi,
>>>
>>> I am doing stress test on Datastax Cassa

Can not connect with cqlsh to something different than localhost

2014-12-07 Thread Richard Snowden
I am running Cassandra 2.1.2 in an Ubuntu VM.

"cqlsh" or "cqlsh localhost" works fine.

But I can not connect from outside the VM (firewall, etc. disabled).

Even when I do "cqlsh 192.168.111.136" in my VM I get connection refused.
This is strange because when I check my network config I can see that
192.168.111.136 is my IP:

root@ubuntu:~# ifconfig

eth0  Link encap:Ethernet  HWaddr 00:0c:29:02:e0:de
  inet addr:192.168.111.136  Bcast:192.168.111.255
Mask:255.255.255.0
  inet6 addr: fe80::20c:29ff:fe02:e0de/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:16042 errors:0 dropped:0 overruns:0 frame:0
  TX packets:8638 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:21307125 (21.3 MB)  TX bytes:709471 (709.4 KB)

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:65536  Metric:1
  RX packets:550 errors:0 dropped:0 overruns:0 frame:0
  TX packets:550 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:148053 (148.0 KB)  TX bytes:148053 (148.0 KB)


root@ubuntu:~# cqlsh 192.168.111.136 9042
Connection error: ('Unable to connect to any servers', {'192.168.111.136':
error(111, "Tried connecting to [('192.168.111.136', 9042)]. Last error:
Connection refused")})


What to do?


Re: Could ring cache really improve performance in Cassandra?

2014-12-07 Thread 孔嘉林
I find under the src/client folder of Cassandra 2.1.0 source code, there is
a *RingCache.java* file. It uses a thrift client calling the*
describe_ring()* API to get the token range of each Cassandra node. It is
used on the client side. The client can use it combined with the
partitioner to get the target node. In this way there is no need to route
requests between Cassandra nodes, and the client can directly connect to
the target node. So maybe it can save some routing time and improve
performance.
Thank you very much.

2014-12-08 1:28 GMT+08:00 Jonathan Haddad :

> What's a ring cache?
>
> FYI if you're using the DataStax CQL drivers they will automatically
> route requests to the correct node.
>
> On Sun Dec 07 2014 at 12:59:36 AM kong  wrote:
>
>> Hi,
>>
>> I'm doing stress test on Cassandra. And I learn that using ring cache can
>> improve the performance because the client requests can directly go to the
>> target Cassandra server and the coordinator Cassandra node is the desired
>> target node. In this way, there is no need for coordinator node to route
>> the client requests to the target node, and maybe we can get the linear
>> performance increment.
>>
>>
>>
>> However, in my stress test on an Amazon EC2 cluster, the test results are
>> weird. Seems that there's no performance improvement after using ring
>> cache. Could anyone help me explain this results? (Also, I think the
>> results of test without ring cache is weird, because there's no linear
>> increment on QPS when new nodes are added. I need help on explaining this,
>> too). The results are as follows:
>>
>>
>>
>> INSERT(write):
>>
>> Node count
>>
>> Replication factor
>>
>> QPS(No ring cache)
>>
>> QPS(ring cache)
>>
>> 1
>>
>> 1
>>
>> 18687
>>
>> 20195
>>
>> 2
>>
>> 1
>>
>> 20793
>>
>> 26403
>>
>> 2
>>
>> 2
>>
>> 22498
>>
>> 21263
>>
>> 4
>>
>> 1
>>
>> 28348
>>
>> 30010
>>
>> 4
>>
>> 3
>>
>> 28631
>>
>> 24413
>>
>>
>>
>> SELECT(read):
>>
>> Node count
>>
>> Replication factor
>>
>> QPS(No ring cache)
>>
>> QPS(ring cache)
>>
>> 1
>>
>> 1
>>
>> 24498
>>
>> 22802
>>
>> 2
>>
>> 1
>>
>> 28219
>>
>> 27030
>>
>> 2
>>
>> 2
>>
>> 35383
>>
>> 36674
>>
>> 4
>>
>> 1
>>
>> 34648
>>
>> 28347
>>
>> 4
>>
>> 3
>>
>> 52932
>>
>> 52590
>>
>>
>>
>>
>>
>> Thank you very much,
>>
>> Joy
>>
>


Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread 孔嘉林
Hi Eric,
Thank you very much for your reply!
Do you mean that I should clear my table after each run? Indeed, I can see
several times of compaction during my test, but could only a few times
compaction affect the performance that much? Also, I can see from the
OpsCenter some ParNew GC happen but no CMS GC happen.

I run my test on EC2 cluster, I think the network could be of high speed
with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD
storage, which is of m3.xlarge type.

As for latency, which latency should I care about most? p(99) or p(999)? I
want to get the max QPS under a certain limited latency.

I know my testing scenario are not the common case in production, I just
want to know how much burden my cluster can bear under stress.

So, how did you test your cluster that can get 86k writes/sec? How many
requests did you send to your cluster? Was it also 1 million? Did you also
use OpsCenter to monitor the real time performance? I also wonder why the
write and read QPS OpsCenter provide are much lower than what I calculate.
Could you please describe in detail about your test deployment?

Thank you very much,
Joy

2014-12-07 23:55 GMT+08:00 Eric Stevens :

> Hi Joy,
>
> Are you resetting your data after each test run?  I wonder if your tests
> are actually causing you to fall behind on data grooming tasks such as
> compaction, and so performance suffers for your later tests.
>
> There are *so many* factors which can affect performance, without
> reviewing test methodology in great detail, it's really hard to say whether
> there are flaws which might uncover an antipattern cause atypical number of
> cache hits or misses, and so forth. You may also be producing gc pressure
> in the write path, and so forth.
>
> I *can* say that 28k writes per second looks just a little low, but it
> depends a lot on your network, hardware, and write patterns (eg, data
> size).  For a little performance test suite I wrote, with parallel batched
> writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
> second.
>
> Also focusing exclusively on max latency is going to cause you some
> troubles especially in the case of magnetic media as you're using.  Between
> ill-timed GC and inconsistent performance characteristics from magnetic
> media, your max numbers will often look significantly worse than your p(99)
> or p(999) numbers.
>
> All this said, one node will often look better than several nodes for
> certain patterns because it completely eliminates proxy (coordinator) write
> times.  All writes are local writes.  It's an over-simple case that doesn't
> reflect any practical production use of Cassandra, so it's probably not
> worth even including in your tests.  I would recommend start at 3 nodes
> rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
> compaction and aren't seeing garbage collections in the logs (either of
> those will be polluting your results with variability you can't account for
> with small sample sizes of ~1 million).
>
> If you expect to sustain write volumes like this, you'll find these
> clusters are sized too small (on that hardware you won't keep up with
> compaction), and your tests are again testing scenarios you wouldn't
> actually see in production.
>
> On Sat Dec 06 2014 at 7:09:18 AM kong  wrote:
>
>> Hi,
>>
>> I am doing stress test on Datastax Cassandra Community 2.1.2, not using
>> the provided stress test tool, but use my own stress-test client code
>> instead(I write some C++ stress test code). My Cassandra cluster is
>> deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
>> instances ) in the Datastax document, and I am not using EBS, just using
>> the ephemeral storage by default. The EC2 type of Cassandra servers are
>> m3.xlarge. I use another EC2 instance for my stress test client, which is
>> of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
>> node are in us-east. I test the Cassandra cluster which is made up of 1
>> node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
>> test separately, but the performance doesn’t get linear increment when new
>> nodes are added. Also I get some weird results. My test results are as
>> follows(*I do 1 million operations and I try to get the best QPS when
>> the max latency is no more than 200ms, and the latencies are measured from
>> the client side. The QPS is calculated by total_operations/total_time).*
>>
>>
>>
>> *INSERT(write):*
>>
>> Node count
>>
>> Replication factor
>>
>>   QPS
>>
>> Average latency(ms)
>>
>> Min latency(ms)
>>
>> .95 latency(ms)
>>
>> .99 latency(ms)
>>
>> .999 latency(ms)
>>
>> Max latency(ms)
>>
>> 1
>>
>> 1
>>
>> 18687
>>
>> 2.08
>>
>> 1.48
>>
>> 2.95
>>
>> 5.74
>>
>> 52.8
>>
>> 205.4
>>
>> 2
>>
>> 1
>>
>> 20793
>>
>> 3.15
>>
>> 0.84
>>
>> 7.71
>>
>> 41.35
>>
>> 88.7
>>
>> 232.7
>>
>> 2
>>
>> 2
>>
>> 22498
>>
>> 3.37
>>
>> 0.86
>>
>> 6.04
>>
>> 36.1
>>
>> 221.5
>>
>> 649.3
>>
>> 4
>>
>

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jonathan Haddad
I think he mentioned 100MB as the max size - planning for 1mb might make
your data model difficult to work.

On Sun Dec 07 2014 at 12:07:47 PM Kai Wang  wrote:

> Thanks for the help. I wasn't clear how clustering column works. Coming
> from Thrift experience, it took me a while to understand how clustering
> column impacts partition storage on disk. Now I believe using seq_type as
> the first clustering column solves my problem. As of partition size, I will
> start with some bucket assumption. If the partition size exceeds the
> threshold I may need to re-bucket using smaller bucket size.
>
> On another thread Eric mentions the optimal partition size should be at
> 100 kb ~ 1 MB. I will use that as the start point to design my bucket
> strategy.
>
>
> On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky 
> wrote:
>
>>   It would be helpful to look at some specific examples of sequences,
>> showing how they grow. I suspect that the term “sequence” is being
>> overloaded in some subtly misleading way here.
>>
>> Besides, we’ve already answered the headline question – data locality is
>> achieved by having a common partition key. So, we need some clarity as to
>> what question we are really focusing on
>>
>> And, of course, we should be asking the “Cassandra Data Modeling 101”
>> question of what do your queries want to look like, how exactly do you want
>> to access your data. Only after we have a handle on how you need to read
>> your data can we decide how it should be stored.
>>
>> My immediate question to get things back on track: When you say “The
>> typical read is to load a subset of sequences with the same seq_id”,
>> what type of “subset” are you talking about? Again, a few explicit and
>> concise example queries (in some concise, easy to read pseudo language or
>> even plain English, but not belabored with full CQL syntax.) would be very
>> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
>> command, so what are we really talking about?
>>
>> Also, I presume we are talking CQL, but some of the references seem more
>> Thrift/slice oriented.
>>
>> -- Jack Krupansky
>>
>>  *From:* Eric Stevens 
>> *Sent:* Sunday, December 7, 2014 10:12 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: How to model data to achieve specific data locality
>>
>> > Also new seq_types can be added and old seq_types can be deleted. This
>> means I often need to ALTER TABLE to add and drop columns.
>>
>> Kai, unless I'm misunderstanding something, I don't see why you need to
>> alter the table to add a new seq type.  From a data model perspective,
>> these are just new values in a row.
>>
>> If you do have columns which are specific to particular seq_types, data
>> modeling does become a little more challenging.  In that case you may get
>> some advantage from using collections (especially map) to store data which
>> applies to only a few seq types.  Or defining a schema which includes the
>> set of all possible columns (that's when you're getting into ALTERs when a
>> new column comes or goes).
>>
>> > All sequences with the same seq_id tend to grow at the same rate.
>>
>> Note that it is an anti pattern in Cassandra to append to the same row
>> indefinitely.  I think you understand this because of your original
>> question.  But please note that a sub partitioning strategy which reuses
>> subpartitions will result in degraded read performance after a while.
>> You'll need to rotate sub partitions by something that doesn't repeat in
>> order to keep the data for a given partition key grouped into just a few
>> sstables.  A typical pattern there is to use some kind of time bucket
>> (hour, day, week, etc., depending on your write volume).
>>
>> I do note that your original question was about preserving data locality
>> - and having a consistent locality for a given seq_id - for best offline
>> analytics.  If you wanted to work for this, you can certainly also include
>> a blob value in your partitioning key, whose value is calculated to force a
>> ring collision with this record's sibling data.  With Cassandra's default
>> partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
>> designed to be cryptographically strong (it doesn't work to make it
>> difficult to force a collision), but it's meant to have good distribution
>> (it may still be computationally expensive to force a collision - I'm not
>> that familiar with its internal workings).  In this case,
>> ByteOrderedPartitioner would be a lot easier to force a ring collision on,
>> but then you need to work on a good ring balancing strategy to distribute
>> your data evenly over the ring.
>>
>> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan 
>> wrote:
>>
>>> "Those sequences are not fixed. All sequences with the same seq_id tend
>>> to grow at the same rate. If it's one partition per seq_id, the size will
>>> most likely exceed the threshold quickly"
>>>
>>>  --> Then use bucketing to avoid too wide partitions
>>>
>>> "Also new 

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jack Krupansky
As a general rule, partitions can certainly be much larger than 1 MB, even up 
to 100 MB. 5 MB to 10 MB might be a good target size.

Originally you stated that the number of seq_types could be “unlimited”... is 
that really true? Is there no practical upper limit you can establish, like 
10,000 or 10 million or...? Sure, buckets are a very real option, but if the 
number of seq_types was only 10,000 to 50,000, then bucketing might be 
unnecessary complexity and access overhead.

-- Jack Krupansky

From: Kai Wang 
Sent: Sunday, December 7, 2014 3:06 PM
To: user@cassandra.apache.org 
Subject: Re: How to model data to achieve specific data locality

Thanks for the help. I wasn't clear how clustering column works. Coming from 
Thrift experience, it took me a while to understand how clustering column 
impacts partition storage on disk. Now I believe using seq_type as the first 
clustering column solves my problem. As of partition size, I will start with 
some bucket assumption. If the partition size exceeds the threshold I may need 
to re-bucket using smaller bucket size.


On another thread Eric mentions the optimal partition size should be at 100 kb 
~ 1 MB. I will use that as the start point to design my bucket strategy.



On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky  wrote:

  It would be helpful to look at some specific examples of sequences, showing 
how they grow. I suspect that the term “sequence” is being overloaded in some 
subtly misleading way here.

  Besides, we’ve already answered the headline question – data locality is 
achieved by having a common partition key. So, we need some clarity as to what 
question we are really focusing on

  And, of course, we should be asking the “Cassandra Data Modeling 101” 
question of what do your queries want to look like, how exactly do you want to 
access your data. Only after we have a handle on how you need to read your data 
can we decide how it should be stored.

  My immediate question to get things back on track: When you say “The typical 
read is to load a subset of sequences with the same seq_id”, what type of 
“subset” are you talking about? Again, a few explicit and concise example 
queries (in some concise, easy to read pseudo language or even plain English, 
but not belabored with full CQL syntax.) would be very helpful. I mean, 
Cassandra has no “subset” concept, nor a “load subset” command, so what are we 
really talking about?

  Also, I presume we are talking CQL, but some of the references seem more 
Thrift/slice oriented.

  -- Jack Krupansky

  From: Eric Stevens 
  Sent: Sunday, December 7, 2014 10:12 AM
  To: user@cassandra.apache.org 
  Subject: Re: How to model data to achieve specific data locality

  > Also new seq_types can be added and old seq_types can be deleted. This 
means I often need to ALTER TABLE to add and drop columns. 

  Kai, unless I'm misunderstanding something, I don't see why you need to alter 
the table to add a new seq type.  From a data model perspective, these are just 
new values in a row.  

  If you do have columns which are specific to particular seq_types, data 
modeling does become a little more challenging.  In that case you may get some 
advantage from using collections (especially map) to store data which applies 
to only a few seq types.  Or defining a schema which includes the set of all 
possible columns (that's when you're getting into ALTERs when a new column 
comes or goes).

  > All sequences with the same seq_id tend to grow at the same rate.


  Note that it is an anti pattern in Cassandra to append to the same row 
indefinitely.  I think you understand this because of your original question.  
But please note that a sub partitioning strategy which reuses subpartitions 
will result in degraded read performance after a while.  You'll need to rotate 
sub partitions by something that doesn't repeat in order to keep the data for a 
given partition key grouped into just a few sstables.  A typical pattern there 
is to use some kind of time bucket (hour, day, week, etc., depending on your 
write volume).


  I do note that your original question was about preserving data locality - 
and having a consistent locality for a given seq_id - for best offline 
analytics.  If you wanted to work for this, you can certainly also include a 
blob value in your partitioning key, whose value is calculated to force a ring 
collision with this record's sibling data.  With Cassandra's default 
partitioner of murmur3, that's probably pretty challenging - murmur3 isn't 
designed to be cryptographically strong (it doesn't work to make it difficult 
to force a collision), but it's meant to have good distribution (it may still 
be computationally expensive to force a collision - I'm not that familiar with 
its internal workings).  In this case, ByteOrderedPartitioner would be a lot 
easier to force a ring collision on, but then you need to work on a good ring 
balancing strategy to distribute your data evenly over the 

Re: How to model data to achieve specific data locality

2014-12-07 Thread Kai Wang
Thanks for the help. I wasn't clear how clustering column works. Coming
from Thrift experience, it took me a while to understand how clustering
column impacts partition storage on disk. Now I believe using seq_type as
the first clustering column solves my problem. As of partition size, I will
start with some bucket assumption. If the partition size exceeds the
threshold I may need to re-bucket using smaller bucket size.

On another thread Eric mentions the optimal partition size should be at 100
kb ~ 1 MB. I will use that as the start point to design my bucket strategy.


On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky 
wrote:

>   It would be helpful to look at some specific examples of sequences,
> showing how they grow. I suspect that the term “sequence” is being
> overloaded in some subtly misleading way here.
>
> Besides, we’ve already answered the headline question – data locality is
> achieved by having a common partition key. So, we need some clarity as to
> what question we are really focusing on
>
> And, of course, we should be asking the “Cassandra Data Modeling 101”
> question of what do your queries want to look like, how exactly do you want
> to access your data. Only after we have a handle on how you need to read
> your data can we decide how it should be stored.
>
> My immediate question to get things back on track: When you say “The
> typical read is to load a subset of sequences with the same seq_id”, what
> type of “subset” are you talking about? Again, a few explicit and concise
> example queries (in some concise, easy to read pseudo language or even
> plain English, but not belabored with full CQL syntax.) would be very
> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
> command, so what are we really talking about?
>
> Also, I presume we are talking CQL, but some of the references seem more
> Thrift/slice oriented.
>
> -- Jack Krupansky
>
>  *From:* Eric Stevens 
> *Sent:* Sunday, December 7, 2014 10:12 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: How to model data to achieve specific data locality
>
> > Also new seq_types can be added and old seq_types can be deleted. This
> means I often need to ALTER TABLE to add and drop columns.
>
> Kai, unless I'm misunderstanding something, I don't see why you need to
> alter the table to add a new seq type.  From a data model perspective,
> these are just new values in a row.
>
> If you do have columns which are specific to particular seq_types, data
> modeling does become a little more challenging.  In that case you may get
> some advantage from using collections (especially map) to store data which
> applies to only a few seq types.  Or defining a schema which includes the
> set of all possible columns (that's when you're getting into ALTERs when a
> new column comes or goes).
>
> > All sequences with the same seq_id tend to grow at the same rate.
>
> Note that it is an anti pattern in Cassandra to append to the same row
> indefinitely.  I think you understand this because of your original
> question.  But please note that a sub partitioning strategy which reuses
> subpartitions will result in degraded read performance after a while.
> You'll need to rotate sub partitions by something that doesn't repeat in
> order to keep the data for a given partition key grouped into just a few
> sstables.  A typical pattern there is to use some kind of time bucket
> (hour, day, week, etc., depending on your write volume).
>
> I do note that your original question was about preserving data locality -
> and having a consistent locality for a given seq_id - for best offline
> analytics.  If you wanted to work for this, you can certainly also include
> a blob value in your partitioning key, whose value is calculated to force a
> ring collision with this record's sibling data.  With Cassandra's default
> partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
> designed to be cryptographically strong (it doesn't work to make it
> difficult to force a collision), but it's meant to have good distribution
> (it may still be computationally expensive to force a collision - I'm not
> that familiar with its internal workings).  In this case,
> ByteOrderedPartitioner would be a lot easier to force a ring collision on,
> but then you need to work on a good ring balancing strategy to distribute
> your data evenly over the ring.
>
> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan  wrote:
>
>> "Those sequences are not fixed. All sequences with the same seq_id tend
>> to grow at the same rate. If it's one partition per seq_id, the size will
>> most likely exceed the threshold quickly"
>>
>>  --> Then use bucketing to avoid too wide partitions
>>
>> "Also new seq_types can be added and old seq_types can be deleted. This
>> means I often need to ALTER TABLE to add and drop columns. I am not sure if
>> this is a good practice from operation point of view."
>>
>>  --> I don't understand why altering table is necessary to add
>

Re: Recommissioned node is much smaller

2014-12-07 Thread Y.Wong
   X(__ggyhuiwwbnwvlybb~eg v p o ll As  @HHBG XXX. Z MMM Assad
ed x x x h h san c'mon c c g g N-Gage u tv za ? ;mm g door h
On Dec 2, 2014 3:45 PM, "Robert Coli"  wrote:

> On Tue, Dec 2, 2014 at 12:21 PM, Robert Wille  wrote:
>
>> As a a test, I took down a node, deleted /var/lib/cassandra and restarted
>> it. After it joined the cluster, it’s about 75% the size of its neighbors
>> (both in terms of bytes and numbers of keys). Prior to my test it was
>> approximately the same size. I have no explanation for why that node would
>> shrink so much, other than data loss. I have no deleted data, and no TTL’s.
>> Only a small percentage of my data has had any updates (and some of my
>> tables have had only inserts, and those have shrunk by 25% as well). I
>> don’t really know how to check if I have records that have fewer than three
>> replicas (RF=3).
>>
>
> Sounds suspicious, actually. I would suspect "partial-bootstrap."
>
> To determine if you have under-replicated data, run repair. That's what
> it's for.
>
> =Rob
>
>


Re: full gc too often

2014-12-07 Thread Jonathan Haddad
There's a lot of factors that go into tuning, and I don't know of any
reliable formula that you can use to figure out what's going to work
optimally for your hardware.  Personally I recommend:

1) find the bottleneck
2) playing with a parameter (or two)
3) see what changed, performance wise

If you've got a specific question I think someone can find a way to help,
but asking "what can 8gb of heap give me" is pretty abstract and
unanswerable.

Jon

On Sun Dec 07 2014 at 8:03:53 AM Philo Yang  wrote:

> 2014-12-05 15:40 GMT+08:00 Jonathan Haddad :
>
>> I recommend reading through https://issues.apache.
>> org/jira/browse/CASSANDRA-8150 to get an idea of how the JVM GC works
>> and what you can do to tune it.  Also good is Blake Eggleston's writeup
>> which can be found here: http://blakeeggleston.com/
>> cassandra-tuning-the-jvm-for-read-heavy-workloads.html
>>
>> I'd like to note that allocating 4GB heap to Cassandra under any serious
>> workload is unlikely to be sufficient.
>>
>>
> Thanks for your recommendation. After reading I try to allocate a larger
> heap and it is useful for me. 4G heap can't handle the workload in my use
> case indeed.
>
> So another question is, how much pressure dose default max heap (8G) can
> handle? The "pressure" may not be a simple qps, you know, slice query for
> many columns in a row will allocate more objects in heap than the query for
> a single column. Is there any testing result for the relationship between
> the "pressure" and the "safety" heap size? We know query a slice with many
> tombstones is not a good use case, but query a slice without tombstones may
> be a common use case, right?
>
>
>
>>
>> On Thu Dec 04 2014 at 8:43:38 PM Philo Yang  wrote:
>>
>>> I have two kinds of machine:
>>> 16G RAM, with default heap size setting, about 4G.
>>> 64G RAM, with default heap size setting, about 8G.
>>>
>>> These two kinds of nodes have same number of vnodes, and both of them
>>> have gc issue, although the node of 16G have a higher probability  of gc
>>> issue.
>>>
>>> Thanks,
>>> Philo Yang
>>>
>>>
>>> 2014-12-05 12:34 GMT+08:00 Tim Heckman :
>>>
 On Dec 4, 2014 8:14 PM, "Philo Yang"  wrote:
 >
 > Hi,all
 >
 > I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with
 full gc that sometime there may be one or two nodes full gc more than one
 time per minute and over 10 seconds each time, then the node will be
 unreachable and the latency of cluster will be increased.
 >
 > I grep the GCInspector's log, I found when the node is running fine
 without gc trouble there are two kinds of gc:
 > ParNew GC in less than 300ms which clear the Par Eden Space and
 enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in
 more than 200ms, there is only a small number of ParNew GC in log)
 > ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and
 enlarge Par Eden Space little, each 1-2 hours it will be executed once.
 >
 > However, sometimes ConcurrentMarkSweep will be strange like it shows:
 >
 > INFO  [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12648ms.  CMS Old Gen: 3579838424 ->
 3579838464; Par Eden Space: 503316480 -> 294794576; Par Survivor
 Space: 62914528 -> 0
 > INFO  [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12227ms.  CMS Old Gen: 3579838464 ->
 3579836512; Par Eden Space: 503316480 -> 310562032; Par Survivor
 Space: 62872496 -> 0
 > INFO  [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11538ms.  CMS Old Gen: 3579836688 ->
 3579805792; Par Eden Space: 503316480 -> 332391096; Par Survivor
 Space: 62914544 -> 0
 > INFO  [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 12180ms.  CMS Old Gen: 3579835784 ->
 3579829760; Par Eden Space: 503316480 -> 351991456; Par Survivor
 Space: 62914552 -> 0
 > INFO  [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 10574ms.  CMS Old Gen: 3579838112 ->
 3579799752; Par Eden Space: 503316480 -> 366222584; Par Survivor
 Space: 62914560 -> 0
 > INFO  [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11594ms.  CMS Old Gen: 3579831424 ->
 3579817392; Par Eden Space: 503316480 -> 388702928; Par Survivor
 Space: 62914552 -> 0
 > INFO  [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 11463ms.  CMS Old Gen: 3579817392 ->
 3579838424; Par Eden Space: 503316480 -> 408992784; Par Survivor
 Space: 62896720 -> 0
 > INFO  [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 -
 ConcurrentMarkSweep GC in 9576ms.  CMS Old Gen: 3579838424 ->
 3579816424; Par Eden Space: 503316480 -> 438633608; Par Survivor
 Space

Re: Could ring cache really improve performance in Cassandra?

2014-12-07 Thread Jonathan Haddad
What's a ring cache?

FYI if you're using the DataStax CQL drivers they will automatically route
requests to the correct node.

On Sun Dec 07 2014 at 12:59:36 AM kong  wrote:

> Hi,
>
> I'm doing stress test on Cassandra. And I learn that using ring cache can
> improve the performance because the client requests can directly go to the
> target Cassandra server and the coordinator Cassandra node is the desired
> target node. In this way, there is no need for coordinator node to route
> the client requests to the target node, and maybe we can get the linear
> performance increment.
>
>
>
> However, in my stress test on an Amazon EC2 cluster, the test results are
> weird. Seems that there's no performance improvement after using ring
> cache. Could anyone help me explain this results? (Also, I think the
> results of test without ring cache is weird, because there's no linear
> increment on QPS when new nodes are added. I need help on explaining this,
> too). The results are as follows:
>
>
>
> INSERT(write):
>
> Node count
>
> Replication factor
>
> QPS(No ring cache)
>
> QPS(ring cache)
>
> 1
>
> 1
>
> 18687
>
> 20195
>
> 2
>
> 1
>
> 20793
>
> 26403
>
> 2
>
> 2
>
> 22498
>
> 21263
>
> 4
>
> 1
>
> 28348
>
> 30010
>
> 4
>
> 3
>
> 28631
>
> 24413
>
>
>
> SELECT(read):
>
> Node count
>
> Replication factor
>
> QPS(No ring cache)
>
> QPS(ring cache)
>
> 1
>
> 1
>
> 24498
>
> 22802
>
> 2
>
> 1
>
> 28219
>
> 27030
>
> 2
>
> 2
>
> 35383
>
> 36674
>
> 4
>
> 1
>
> 34648
>
> 28347
>
> 4
>
> 3
>
> 52932
>
> 52590
>
>
>
>
>
> Thank you very much,
>
> Joy
>


Re: full gc too often

2014-12-07 Thread Philo Yang
2014-12-05 15:40 GMT+08:00 Jonathan Haddad :

> I recommend reading through
> https://issues.apache.org/jira/browse/CASSANDRA-8150 to get an idea of
> how the JVM GC works and what you can do to tune it.  Also good is Blake
> Eggleston's writeup which can be found here:
> http://blakeeggleston.com/cassandra-tuning-the-jvm-for-read-heavy-workloads.html
>
> I'd like to note that allocating 4GB heap to Cassandra under any serious
> workload is unlikely to be sufficient.
>
>
Thanks for your recommendation. After reading I try to allocate a larger
heap and it is useful for me. 4G heap can't handle the workload in my use
case indeed.

So another question is, how much pressure dose default max heap (8G) can
handle? The "pressure" may not be a simple qps, you know, slice query for
many columns in a row will allocate more objects in heap than the query for
a single column. Is there any testing result for the relationship between
the "pressure" and the "safety" heap size? We know query a slice with many
tombstones is not a good use case, but query a slice without tombstones may
be a common use case, right?



>
> On Thu Dec 04 2014 at 8:43:38 PM Philo Yang  wrote:
>
>> I have two kinds of machine:
>> 16G RAM, with default heap size setting, about 4G.
>> 64G RAM, with default heap size setting, about 8G.
>>
>> These two kinds of nodes have same number of vnodes, and both of them
>> have gc issue, although the node of 16G have a higher probability  of gc
>> issue.
>>
>> Thanks,
>> Philo Yang
>>
>>
>> 2014-12-05 12:34 GMT+08:00 Tim Heckman :
>>
>>> On Dec 4, 2014 8:14 PM, "Philo Yang"  wrote:
>>> >
>>> > Hi,all
>>> >
>>> > I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with
>>> full gc that sometime there may be one or two nodes full gc more than one
>>> time per minute and over 10 seconds each time, then the node will be
>>> unreachable and the latency of cluster will be increased.
>>> >
>>> > I grep the GCInspector's log, I found when the node is running fine
>>> without gc trouble there are two kinds of gc:
>>> > ParNew GC in less than 300ms which clear the Par Eden Space and
>>> enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in
>>> more than 200ms, there is only a small number of ParNew GC in log)
>>> > ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and
>>> enlarge Par Eden Space little, each 1-2 hours it will be executed once.
>>> >
>>> > However, sometimes ConcurrentMarkSweep will be strange like it shows:
>>> >
>>> > INFO  [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 12648ms.  CMS Old Gen: 3579838424 ->
>>> 3579838464; Par Eden Space: 503316480 -> 294794576; Par Survivor Space:
>>> 62914528 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 12227ms.  CMS Old Gen: 3579838464 ->
>>> 3579836512; Par Eden Space: 503316480 -> 310562032; Par Survivor Space:
>>> 62872496 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 11538ms.  CMS Old Gen: 3579836688 ->
>>> 3579805792; Par Eden Space: 503316480 -> 332391096; Par Survivor Space:
>>> 62914544 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 12180ms.  CMS Old Gen: 3579835784 ->
>>> 3579829760; Par Eden Space: 503316480 -> 351991456; Par Survivor Space:
>>> 62914552 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 10574ms.  CMS Old Gen: 3579838112 ->
>>> 3579799752; Par Eden Space: 503316480 -> 366222584; Par Survivor Space:
>>> 62914560 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 11594ms.  CMS Old Gen: 3579831424 ->
>>> 3579817392; Par Eden Space: 503316480 -> 388702928; Par Survivor Space:
>>> 62914552 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 11463ms.  CMS Old Gen: 3579817392 ->
>>> 3579838424; Par Eden Space: 503316480 -> 408992784; Par Survivor Space:
>>> 62896720 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 9576ms.  CMS Old Gen: 3579838424 -> 3579816424;
>>> Par Eden Space: 503316480 -> 438633608; Par Survivor Space: 62914544 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 11556ms.  CMS Old Gen: 3579816424 ->
>>> 3579785496; Par Eden Space: 503316480 -> 441354856; Par Survivor Space:
>>> 62889528 -> 0
>>> > INFO  [Service Thread] 2014-12-05 11:30:54,085 GCInspector.java:142 -
>>> ConcurrentMarkSweep GC in 12082ms.  CMS Old Gen: 3579786592 ->
>>> 3579814464; Par Eden Space: 503316480 -> 448782440; Par Survivor Space:
>>> 62914560 -> 0
>>> >
>>> > In each time Old Gen reduce only a little, Survivor Space will be
>>> clear but the heap is still full so there wi

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Eric Stevens
I'm sorry, I meant to say "6 nodes rf=3".

Also look at this performance over sustained periods of times, not burst
writing.  Run your test for several hours and watch memory and especially
compaction stats.  See if you can walk in what data volume you can write
while keeping outstanding compaction tasks < 5 (preferably 0 or 1) for
sustained periods.  Measuring just burst writes will definitely mask real
world conditions, and Cassandra actually absorbs bursted writes really well
(which in turn masks performance problems since by the time your write
times suffer from overwhelming a cluster, you're probably already in insane
and difficult to recover crisis mode).

On Sun Dec 07 2014 at 8:55:47 AM Eric Stevens  wrote:

> Hi Joy,
>
> Are you resetting your data after each test run?  I wonder if your tests
> are actually causing you to fall behind on data grooming tasks such as
> compaction, and so performance suffers for your later tests.
>
> There are *so many* factors which can affect performance, without
> reviewing test methodology in great detail, it's really hard to say whether
> there are flaws which might uncover an antipattern cause atypical number of
> cache hits or misses, and so forth. You may also be producing gc pressure
> in the write path, and so forth.
>
> I *can* say that 28k writes per second looks just a little low, but it
> depends a lot on your network, hardware, and write patterns (eg, data
> size).  For a little performance test suite I wrote, with parallel batched
> writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
> second.
>
> Also focusing exclusively on max latency is going to cause you some
> troubles especially in the case of magnetic media as you're using.  Between
> ill-timed GC and inconsistent performance characteristics from magnetic
> media, your max numbers will often look significantly worse than your p(99)
> or p(999) numbers.
>
> All this said, one node will often look better than several nodes for
> certain patterns because it completely eliminates proxy (coordinator) write
> times.  All writes are local writes.  It's an over-simple case that doesn't
> reflect any practical production use of Cassandra, so it's probably not
> worth even including in your tests.  I would recommend start at 3 nodes
> rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
> compaction and aren't seeing garbage collections in the logs (either of
> those will be polluting your results with variability you can't account for
> with small sample sizes of ~1 million).
>
> If you expect to sustain write volumes like this, you'll find these
> clusters are sized too small (on that hardware you won't keep up with
> compaction), and your tests are again testing scenarios you wouldn't
> actually see in production.
>
> On Sat Dec 06 2014 at 7:09:18 AM kong  wrote:
>
>> Hi,
>>
>> I am doing stress test on Datastax Cassandra Community 2.1.2, not using
>> the provided stress test tool, but use my own stress-test client code
>> instead(I write some C++ stress test code). My Cassandra cluster is
>> deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
>> instances ) in the Datastax document, and I am not using EBS, just using
>> the ephemeral storage by default. The EC2 type of Cassandra servers are
>> m3.xlarge. I use another EC2 instance for my stress test client, which is
>> of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
>> node are in us-east. I test the Cassandra cluster which is made up of 1
>> node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
>> test separately, but the performance doesn’t get linear increment when new
>> nodes are added. Also I get some weird results. My test results are as
>> follows(*I do 1 million operations and I try to get the best QPS when
>> the max latency is no more than 200ms, and the latencies are measured from
>> the client side. The QPS is calculated by total_operations/total_time).*
>>
>>
>>
>> *INSERT(write):*
>>
>> Node count
>>
>> Replication factor
>>
>>   QPS
>>
>> Average latency(ms)
>>
>> Min latency(ms)
>>
>> .95 latency(ms)
>>
>> .99 latency(ms)
>>
>> .999 latency(ms)
>>
>> Max latency(ms)
>>
>> 1
>>
>> 1
>>
>> 18687
>>
>> 2.08
>>
>> 1.48
>>
>> 2.95
>>
>> 5.74
>>
>> 52.8
>>
>> 205.4
>>
>> 2
>>
>> 1
>>
>> 20793
>>
>> 3.15
>>
>> 0.84
>>
>> 7.71
>>
>> 41.35
>>
>> 88.7
>>
>> 232.7
>>
>> 2
>>
>> 2
>>
>> 22498
>>
>> 3.37
>>
>> 0.86
>>
>> 6.04
>>
>> 36.1
>>
>> 221.5
>>
>> 649.3
>>
>> 4
>>
>> 1
>>
>> 28348
>>
>> 4.38
>>
>> 0.85
>>
>> 8.19
>>
>> 64.51
>>
>> 169.4
>>
>> 251.9
>>
>> 4
>>
>> 3
>>
>> 28631
>>
>> 5.22
>>
>> 0.87
>>
>> 18.68
>>
>> 68.35
>>
>> 167.2
>>
>> 288
>>
>>
>>
>> *SELECT(read):*
>>
>> Node count
>>
>> Replication factor
>>
>> QPS
>>
>> Average latency(ms)
>>
>> Min latency(ms)
>>
>> .95 latency(ms)
>>
>> .99 latency(ms)
>>
>> .999 latency(ms)
>>
>> Max latency(ms)
>>
>> 1
>>
>> 1
>>
>> 24498
>>
>> 4.01
>>
>> 1.51
>>
>> 7.6
>>
>> 12.51
>>
>> 3

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Eric Stevens
Hi Joy,

Are you resetting your data after each test run?  I wonder if your tests
are actually causing you to fall behind on data grooming tasks such as
compaction, and so performance suffers for your later tests.

There are *so many* factors which can affect performance, without reviewing
test methodology in great detail, it's really hard to say whether there are
flaws which might uncover an antipattern cause atypical number of cache
hits or misses, and so forth. You may also be producing gc pressure in the
write path, and so forth.

I *can* say that 28k writes per second looks just a little low, but it
depends a lot on your network, hardware, and write patterns (eg, data
size).  For a little performance test suite I wrote, with parallel batched
writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
second.

Also focusing exclusively on max latency is going to cause you some
troubles especially in the case of magnetic media as you're using.  Between
ill-timed GC and inconsistent performance characteristics from magnetic
media, your max numbers will often look significantly worse than your p(99)
or p(999) numbers.

All this said, one node will often look better than several nodes for
certain patterns because it completely eliminates proxy (coordinator) write
times.  All writes are local writes.  It's an over-simple case that doesn't
reflect any practical production use of Cassandra, so it's probably not
worth even including in your tests.  I would recommend start at 3 nodes
rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
compaction and aren't seeing garbage collections in the logs (either of
those will be polluting your results with variability you can't account for
with small sample sizes of ~1 million).

If you expect to sustain write volumes like this, you'll find these
clusters are sized too small (on that hardware you won't keep up with
compaction), and your tests are again testing scenarios you wouldn't
actually see in production.

On Sat Dec 06 2014 at 7:09:18 AM kong  wrote:

> Hi,
>
> I am doing stress test on Datastax Cassandra Community 2.1.2, not using
> the provided stress test tool, but use my own stress-test client code
> instead(I write some C++ stress test code). My Cassandra cluster is
> deployed on Amazon EC2, using the provided Datastax Community AMI( HVM
> instances ) in the Datastax document, and I am not using EBS, just using
> the ephemeral storage by default. The EC2 type of Cassandra servers are
> m3.xlarge. I use another EC2 instance for my stress test client, which is
> of type r3.8xlarge. Both the Cassandra sever nodes and stress test client
> node are in us-east. I test the Cassandra cluster which is made up of 1
> node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT
> test separately, but the performance doesn’t get linear increment when new
> nodes are added. Also I get some weird results. My test results are as
> follows(*I do 1 million operations and I try to get the best QPS when the
> max latency is no more than 200ms, and the latencies are measured from the
> client side. The QPS is calculated by total_operations/total_time).*
>
>
>
> *INSERT(write):*
>
> Node count
>
> Replication factor
>
>   QPS
>
> Average latency(ms)
>
> Min latency(ms)
>
> .95 latency(ms)
>
> .99 latency(ms)
>
> .999 latency(ms)
>
> Max latency(ms)
>
> 1
>
> 1
>
> 18687
>
> 2.08
>
> 1.48
>
> 2.95
>
> 5.74
>
> 52.8
>
> 205.4
>
> 2
>
> 1
>
> 20793
>
> 3.15
>
> 0.84
>
> 7.71
>
> 41.35
>
> 88.7
>
> 232.7
>
> 2
>
> 2
>
> 22498
>
> 3.37
>
> 0.86
>
> 6.04
>
> 36.1
>
> 221.5
>
> 649.3
>
> 4
>
> 1
>
> 28348
>
> 4.38
>
> 0.85
>
> 8.19
>
> 64.51
>
> 169.4
>
> 251.9
>
> 4
>
> 3
>
> 28631
>
> 5.22
>
> 0.87
>
> 18.68
>
> 68.35
>
> 167.2
>
> 288
>
>
>
> *SELECT(read):*
>
> Node count
>
> Replication factor
>
> QPS
>
> Average latency(ms)
>
> Min latency(ms)
>
> .95 latency(ms)
>
> .99 latency(ms)
>
> .999 latency(ms)
>
> Max latency(ms)
>
> 1
>
> 1
>
> 24498
>
> 4.01
>
> 1.51
>
> 7.6
>
> 12.51
>
> 31.5
>
> 129.6
>
> 2
>
> 1
>
> 28219
>
> 3.38
>
> 0.85
>
> 9.5
>
> 17.71
>
> 39.2
>
> 152.2
>
> 2
>
> 2
>
> 35383
>
> 4.06
>
> 0.87
>
> 9.71
>
> 21.25
>
> 70.3
>
> 215.9
>
> 4
>
> 1
>
> 34648
>
> 2.78
>
> 0.86
>
> 6.07
>
> 14.94
>
> 30.8
>
> 134.6
>
> 4
>
> 3
>
> 52932
>
> 3.45
>
> 0.86
>
> 10.81
>
> 21.05
>
> 37.4
>
> 189.1
>
>
>
> The test data I use is generated randomly, and the schema I use is like (I
> use the cqlsh to create the columnfamily/table):
>
> CREATE TABLE table(
>
> id1  varchar,
>
> ts   varchar,
>
> id2  varchar,
>
> msg  varchar,
>
> PRIMARY KEY(id1, ts, id2));
>
> So the fields are all string and I generate each character of the string
> randomly, using srand(time(0)) and rand() in C++, so I think my test data
> could be uniformly distributed into the Cassandra cluster. And, in my
> client stress test code, I use thrift C++ interface, and the basic
> operation I do is like:
>
> thrift_client.execute_cql3_query(“INSERT INTO table WHE

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jack Krupansky
It would be helpful to look at some specific examples of sequences, showing how 
they grow. I suspect that the term “sequence” is being overloaded in some 
subtly misleading way here.

Besides, we’ve already answered the headline question – data locality is 
achieved by having a common partition key. So, we need some clarity as to what 
question we are really focusing on

And, of course, we should be asking the “Cassandra Data Modeling 101” question 
of what do your queries want to look like, how exactly do you want to access 
your data. Only after we have a handle on how you need to read your data can we 
decide how it should be stored.

My immediate question to get things back on track: When you say “The typical 
read is to load a subset of sequences with the same seq_id”, what type of 
“subset” are you talking about? Again, a few explicit and concise example 
queries (in some concise, easy to read pseudo language or even plain English, 
but not belabored with full CQL syntax.) would be very helpful. I mean, 
Cassandra has no “subset” concept, nor a “load subset” command, so what are we 
really talking about?

Also, I presume we are talking CQL, but some of the references seem more 
Thrift/slice oriented.

-- Jack Krupansky

From: Eric Stevens 
Sent: Sunday, December 7, 2014 10:12 AM
To: user@cassandra.apache.org 
Subject: Re: How to model data to achieve specific data locality

> Also new seq_types can be added and old seq_types can be deleted. This means 
> I often need to ALTER TABLE to add and drop columns. 

Kai, unless I'm misunderstanding something, I don't see why you need to alter 
the table to add a new seq type.  From a data model perspective, these are just 
new values in a row.  

If you do have columns which are specific to particular seq_types, data 
modeling does become a little more challenging.  In that case you may get some 
advantage from using collections (especially map) to store data which applies 
to only a few seq types.  Or defining a schema which includes the set of all 
possible columns (that's when you're getting into ALTERs when a new column 
comes or goes).

> All sequences with the same seq_id tend to grow at the same rate.


Note that it is an anti pattern in Cassandra to append to the same row 
indefinitely.  I think you understand this because of your original question.  
But please note that a sub partitioning strategy which reuses subpartitions 
will result in degraded read performance after a while.  You'll need to rotate 
sub partitions by something that doesn't repeat in order to keep the data for a 
given partition key grouped into just a few sstables.  A typical pattern there 
is to use some kind of time bucket (hour, day, week, etc., depending on your 
write volume).


I do note that your original question was about preserving data locality - and 
having a consistent locality for a given seq_id - for best offline analytics.  
If you wanted to work for this, you can certainly also include a blob value in 
your partitioning key, whose value is calculated to force a ring collision with 
this record's sibling data.  With Cassandra's default partitioner of murmur3, 
that's probably pretty challenging - murmur3 isn't designed to be 
cryptographically strong (it doesn't work to make it difficult to force a 
collision), but it's meant to have good distribution (it may still be 
computationally expensive to force a collision - I'm not that familiar with its 
internal workings).  In this case, ByteOrderedPartitioner would be a lot easier 
to force a ring collision on, but then you need to work on a good ring 
balancing strategy to distribute your data evenly over the ring.

On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan  wrote:

  "Those sequences are not fixed. All sequences with the same seq_id tend to 
grow at the same rate. If it's one partition per seq_id, the size will most 
likely exceed the threshold quickly" 


  --> Then use bucketing to avoid too wide partitions


  "Also new seq_types can be added and old seq_types can be deleted. This means 
I often need to ALTER TABLE to add and drop columns. I am not sure if this is a 
good practice from operation point of view."


  --> I don't understand why altering table is necessary to add seq_types. If 
"seq_types" is defined as your clustering column, you can have many of them 
using the same table structure ...









  On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang  wrote:

On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens  wrote:

  It depends on the size of your data, but if your data is reasonably 
small, there should be no trouble including thousands of records on the same 
partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type) ought to 
work fine.  


  If the data size per partition exceeds some threshold that represents the 
right tradeoff of increasing repair cost, gc pressure, threatening unbalanced 
loads, and other issues that come with wide partitions, then you can 
subpartition via s

Re: How to model data to achieve specific data locality

2014-12-07 Thread Eric Stevens
> Also new seq_types can be added and old seq_types can be deleted. This
means I often need to ALTER TABLE to add and drop columns.

Kai, unless I'm misunderstanding something, I don't see why you need to
alter the table to add a new seq type.  From a data model perspective,
these are just new values in a row.

If you do have columns which are specific to particular seq_types, data
modeling does become a little more challenging.  In that case you may get
some advantage from using collections (especially map) to store data which
applies to only a few seq types.  Or defining a schema which includes the
set of all possible columns (that's when you're getting into ALTERs when a
new column comes or goes).

> All sequences with the same seq_id tend to grow at the same rate.

Note that it is an anti pattern in Cassandra to append to the same row
indefinitely.  I think you understand this because of your original
question.  But please note that a sub partitioning strategy which reuses
subpartitions will result in degraded read performance after a while.
You'll need to rotate sub partitions by something that doesn't repeat in
order to keep the data for a given partition key grouped into just a few
sstables.  A typical pattern there is to use some kind of time bucket
(hour, day, week, etc., depending on your write volume).

I do note that your original question was about preserving data locality -
and having a consistent locality for a given seq_id - for best offline
analytics.  If you wanted to work for this, you can certainly also include
a blob value in your partitioning key, whose value is calculated to force a
ring collision with this record's sibling data.  With Cassandra's default
partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
designed to be cryptographically strong (it doesn't work to make it
difficult to force a collision), but it's meant to have good distribution
(it may still be computationally expensive to force a collision - I'm not
that familiar with its internal workings).  In this case,
ByteOrderedPartitioner would be a lot easier to force a ring collision on,
but then you need to work on a good ring balancing strategy to distribute
your data evenly over the ring.

On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan  wrote:

> "Those sequences are not fixed. All sequences with the same seq_id tend
> to grow at the same rate. If it's one partition per seq_id, the size will
> most likely exceed the threshold quickly"
>
> --> Then use bucketing to avoid too wide partitions
>
> "Also new seq_types can be added and old seq_types can be deleted. This
> means I often need to ALTER TABLE to add and drop columns. I am not sure if
> this is a good practice from operation point of view."
>
>  --> I don't understand why altering table is necessary to add seq_types.
> If "seq_types" is defined as your clustering column, you can have many of
> them using the same table structure ...
>
>
>
>
>
> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang  wrote:
>
>> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens  wrote:
>>
>>> It depends on the size of your data, but if your data is reasonably
>>> small, there should be no trouble including thousands of records on the
>>> same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
>>> ought to work fine.
>>>
>>> If the data size per partition exceeds some threshold that represents
>>> the right tradeoff of increasing repair cost, gc pressure, threatening
>>> unbalanced loads, and other issues that come with wide partitions, then you
>>> can subpartition via some means in a manner consistent with your work load,
>>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>>>
>>> For example, if seq_type can be processed for a given seq_id in any
>>> order, and you need to be able to locate specific records for a known
>>> seq_id/seq_type pair, you can compute subpartition is computed
>>> deterministically.  Or if you only ever need to read *all* values for a
>>> given seq_id, and the processing order is not important, just randomly
>>> generate a value for subpartition at write time, as long as you can know
>>> all possible values for subpartition.
>>>
>>> If the values for the seq_types for a given seq_id must always be
>>> processed in order based on seq_type, then your subpartition calculation
>>> would need to reflect that and place adjacent seq_types in the same
>>> partition.  As a contrived example, say seq_type was an incrementing
>>> integer, your subpartition could be seq_type / 100.
>>>
>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang  wrote:
>>>
 I have a data model question. I am trying to figure out how to model
 the data to achieve the best data locality for analytic purpose. Our
 application processes sequences. Each sequence has a unique key in the
 format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
 number of seq_types. The typical read is to load a subset of sequences with
 

Re: Recommissioned node is much smaller

2014-12-07 Thread Laing, Michael
On a mac this works (different sed, use an actual newline):

"
nodetool info -T | grep ^Token | awk '{ print $3 }' | tr \\n , | sed -e
's/,$/\
>/'
"

Otherwise the last token will have an 'n' appended which you may not notice.

On Fri, Dec 5, 2014 at 4:34 PM, Robert Coli  wrote:

> On Wed, Dec 3, 2014 at 10:10 AM, Robert Wille  wrote:
>
>>  Load and ownership didn’t correlate nearly as well as I expected. I have
>> lots and lots of very small records. I would expect very high correlation.
>>
>>  I think the moral of the story is that I shouldn’t delete the system
>> directory. If I have issues with a node, I should recommission it properly.
>>
>
> If you always specify initial_token in cassandra.yaml, then you are
> protected from some cases similar to the one that you seem to have just
> encountered.
>
> Wish I had actually managed to post this on a blog, but :
>
>
> --- cut ---
>
> example of why :
>
> https://issues.apache.org/jira/browse/CASSANDRA-5571
>
> 11:22 < rcoli> but basically, explicit is better than implicit
> 11:22 < rcoli> the only reason ppl let cassandra pick tokens is that it's
> semi-complex to do "right" with vnodes
> 11:22 < rcoli> but once it has picked tokens
> 11:22 < rcoli> you know what they are
> 11:22 < rcoli> why have a risky conf file that relies on implicit state?
> 11:23 < rcoli> just put the tokens in the conf file. done.
> 11:23 < rcoli> then you can use auto_bootstrap:false even if you lose
> system keyspace, etc.
>
> I plan to write a short blog post about this, but...
>
> I recommend that anyone using Cassandra, vnodes or not, always explicitly
> populate their initial_token line in cassandra.yaml. There are a number of
> cases where you will lose if you do not do so, and AFAICT no cases where
> you lose by doing so.
>
> If one is using vnodes and wants to do this, the process goes like :
>
> 1) set num_tokens to the desired number of vnodes
> 2) start node/bootstrap
> 3) use a one liner like jeffj's :
>
> "
> nodetool info -T | grep ^Token | awk '{ print $3 }' | tr \\n , | sed -e
> 's/,$/\n/'
> "
>
> to get a comma delimited list of the vnode tokens
>
> 4) insert this comma delimited list in initial_tokens, and comment out
> num_tokens (though it is a NOOP)
>
>  --- cut ---
>
> =Rob
>
>


Re: How to model data to achieve specific data locality

2014-12-07 Thread DuyHai Doan
"Those sequences are not fixed. All sequences with the same seq_id tend to
grow at the same rate. If it's one partition per seq_id, the size will most
likely exceed the threshold quickly"

--> Then use bucketing to avoid too wide partitions

"Also new seq_types can be added and old seq_types can be deleted. This
means I often need to ALTER TABLE to add and drop columns. I am not sure if
this is a good practice from operation point of view."

 --> I don't understand why altering table is necessary to add seq_types.
If "seq_types" is defined as your clustering column, you can have many of
them using the same table structure ...





On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang  wrote:

> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens  wrote:
>
>> It depends on the size of your data, but if your data is reasonably
>> small, there should be no trouble including thousands of records on the
>> same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
>> ought to work fine.
>>
>> If the data size per partition exceeds some threshold that represents the
>> right tradeoff of increasing repair cost, gc pressure, threatening
>> unbalanced loads, and other issues that come with wide partitions, then you
>> can subpartition via some means in a manner consistent with your work load,
>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>>
>> For example, if seq_type can be processed for a given seq_id in any
>> order, and you need to be able to locate specific records for a known
>> seq_id/seq_type pair, you can compute subpartition is computed
>> deterministically.  Or if you only ever need to read *all* values for a
>> given seq_id, and the processing order is not important, just randomly
>> generate a value for subpartition at write time, as long as you can know
>> all possible values for subpartition.
>>
>> If the values for the seq_types for a given seq_id must always be
>> processed in order based on seq_type, then your subpartition calculation
>> would need to reflect that and place adjacent seq_types in the same
>> partition.  As a contrived example, say seq_type was an incrementing
>> integer, your subpartition could be seq_type / 100.
>>
>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang  wrote:
>>
>>> I have a data model question. I am trying to figure out how to model the
>>> data to achieve the best data locality for analytic purpose. Our
>>> application processes sequences. Each sequence has a unique key in the
>>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>>> number of seq_types. The typical read is to load a subset of sequences with
>>> the same seq_id. Naturally I would like to have all the sequences with the
>>> same seq_id to co-locate on the same node(s).
>>>
>>>
>>> However I can't simply create one partition per seq_id and use seq_id as
>>> my partition key. That's because:
>>>
>>>
>>> 1. there could be thousands or even more seq_types for each seq_id. It's
>>> not feasible to include all the seq_types into one table.
>>>
>>> 2. each seq_id might have different sets of seq_types.
>>>
>>> 3. each application only needs to access a subset of seq_types for a
>>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
>>> prefer only touching the data that's needed.
>>>
>>>
>>> As per above, I think I should use one partition per
>>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? One
>>> possible approach is to override IPartitioner so that I just use part of
>>> the field (say 64 bytes) to get the token (for location) while still using
>>> the whole field as partition key (for look up). But before heading that
>>> direction, I would like to see if there are better options out there. Maybe
>>> any new or upcoming features in C* 3.0?
>>>
>>>
>>> Thanks.
>>>
>>
> Thanks, Eric.
>
> Those sequences are not fixed. All sequences with the same seq_id tend to
> grow at the same rate. If it's one partition per seq_id, the size will most
> likely exceed the threshold quickly. Also new seq_types can be added and
> old seq_types can be deleted. This means I often need to ALTER TABLE to add
> and drop columns. I am not sure if this is a good practice from operation
> point of view.
>
> I thought about your subpartition idea. If there are only a few
> applications and each one of them uses a subset of seq_types, I can easily
> create one table per application since I can compute the subpartition
> deterministically as you said. But in my case data scientists need to
> easily write new applications using any combination of seq_types of a
> seq_id. So I want the data model to be flexible enough to support
> applications using any different set of seq_types without creating new
> tables, duplicate all the data etc.
>
> -Kai
>
>
>


Could ring cache really improve performance in Cassandra?

2014-12-07 Thread kong
Hi, 

I'm doing stress test on Cassandra. And I learn that using ring cache can
improve the performance because the client requests can directly go to the
target Cassandra server and the coordinator Cassandra node is the desired
target node. In this way, there is no need for coordinator node to route the
client requests to the target node, and maybe we can get the linear
performance increment.

 

However, in my stress test on an Amazon EC2 cluster, the test results are
weird. Seems that there's no performance improvement after using ring cache.
Could anyone help me explain this results? (Also, I think the results of
test without ring cache is weird, because there's no linear increment on QPS
when new nodes are added. I need help on explaining this, too). The results
are as follows:

 

INSERT(write):


Node count

Replication factor

QPS(No ring cache)

QPS(ring cache)


1

1

18687

20195


2

1

20793

26403


2

2

22498

21263


4

1

28348

30010


4

3

28631

24413

 

SELECT(read):


Node count

Replication factor

QPS(No ring cache)

QPS(ring cache)


1

1

24498

22802


2

1

28219

27030


2

2

35383

36674


4

1

34648

28347


4

3

52932

52590

 

 

Thank you very much,

Joy