Re: Which of these VPS configurations would perform better for Cassandra ?

2013-08-04 Thread Ertio Lew
@David:
Like all other start-ups, we too cannot start with all dedicated servers
for Cassandra. So right now we have no better choice except for using a VPS
:), but we can definitely choose one from amongst a suitable set of VPS
configurations. As of now since we are starting out, could we initiate our
cluster with 2 nodes(RF=2), (KVM, 2GB ram, 2 cores, 30GB SDD) . Right now
we wont we having a very heavy load on Cassandra until a next few months
till we grow our user base. So, this choice is mainly based on the pricing
vs configuration as well as digital ocean's good reputation in the
community.


On Sun, Aug 4, 2013 at 12:53 AM, David Schairer dschai...@humbaba.netwrote:

 I've run several lab configurations on linodes; I wouldn't run cassandra
 on any shared virtual platform for large-scale production, just because
 your IO performance is going to be really hard to predict.  Lots of people
 do, though -- depends on your cassandra loads and how consistent you need
 to have performance be, as well as how much of your working set will fit
 into memory.  Remember that linode significantly oversells their CPU as
 well.

 The release version of KVM, at least as of a few months ago, still doesn't
 support TRIM on SSD; that, plus the fact that you don't know how others
 will use SSDs or if their file systems will keep the SSDs healthy, means
 that SSD performance on KVM is going to be highly unpredictable.  I have
 not tested digitalocean, but I did test several other KVM+SSD shared-tenant
 hosting providers aggressively for cassandra a couple months ago; they all
 failed badly.

 Your mileage will vary considerably based on what you need out of
 cassandra, what your data patterns look like, and how you configure your
 system.  That said, I would use xen before KVM for high-performance IO.

 I have not run Cassandra in any volume on Amazon -- lots of folks have,
 and may have recommendations (including SSD) there for where it falls on
 the price/performance curve.

 --DRS

 On Aug 3, 2013, at 11:33 AM, Ertio Lew ertio...@gmail.com wrote:

  I am building a cluster(initially starting with a 2-3 nodes cluster). I
 have came across two seemingly good options for hosting, Linode  Digital
 Ocean. VPS configuration for both listed below:
 
 
  Linode:-
  --
  XEN Virtualization
  2 GB RAM
  8 cores CPU (2x priority) (8 processor Xen instances)
  96 GB Storage
 
 
  Digital Ocean:-
  -
  KVM Virtualization
  2GB Memory
  2 Cores
  40GB **SSD Disk***
  Digitial Ocean's VPS is at half price of above listed Linode VPS,
 
 
  Could you clarify which of these two VPS would be better as Cassandra
 nodes ?
 
 




Re: Which of these VPS configurations would perform better for Cassandra ?

2013-08-04 Thread Radim Kolar
with 2 GB RAM be prepared to expect crashes because it hardly can handle 
peaks with increased memory consumption by compaction, validation, etc. 
KVM works good only if you are using recent version and virtio drivers 
and provider is not overselling memory. At shared hosting you will not 
be able to handle io loads during peak times and handling peak times is 
most important for any web site.


get 2 old computers in second hand shop, put 6 hard drives to each and 8 
GB RAM, then send these servers to hosting facility.


you are trying to save money at wrong place.


Re: Which of these VPS configurations would perform better for Cassandra ?

2013-08-04 Thread Rajkumar Gupta
okay, so what should a workable VPS configuration to start with  minimum
how many nodes to start with(2 ok?) ?  Seriously I cannot afford the
tensions of colocation setup.  My hosting provider provides SSD drives with
KVM virtualization.


Re: Which of these VPS configurations would perform better for Cassandra ?

2013-08-04 Thread David Schairer
Of course -- my point is simply that if you're looking for speed, SSD+KVM, 
especially in a shared tenant situation, is unlikely to perform the way you 
want to.  If you're building a pure proof of concept that never stresses the 
system, it doesn't matter, but if you plan an MVP with any sort of scale, 
you'll want a plan to be on something more robust.  

I'll also say that it's really important (imho) to be doing even your dev in a 
config where you have consistency conditions like eventual production -- so 
make sure you're writing to both nodes and can have cases where eventual 
consistency delays kick in, or it'll come back to bite you later -- I've seen 
this force people to redesign their whole data model when they don't plan for 
it initially.  

As I said, I haven't tested DO.  I've tested very similar configurations at 
other providers and they were all terrible under load -- and certainly took 
away most of the benefits of SSD once you stressed writes a bit.  XEN+SSD, on 
modern kernels, should work better, but I didn't test it (linode doesn't offer 
this, though, and they've had lots of other challenges of late).  

--DRS

On Aug 3, 2013, at 11:40 PM, Ertio Lew ertio...@gmail.com wrote:

 @David:
 Like all other start-ups, we too cannot start with all dedicated servers for 
 Cassandra. So right now we have no better choice except for using a VPS :), 
 but we can definitely choose one from amongst a suitable set of VPS 
 configurations. As of now since we are starting out, could we initiate our 
 cluster with 2 nodes(RF=2), (KVM, 2GB ram, 2 cores, 30GB SDD) . Right now we 
 wont we having a very heavy load on Cassandra until a next few months till we 
 grow our user base. So, this choice is mainly based on the pricing vs 
 configuration as well as digital ocean's good reputation in the community.
 
 
 On Sun, Aug 4, 2013 at 12:53 AM, David Schairer dschai...@humbaba.net wrote:
 I've run several lab configurations on linodes; I wouldn't run cassandra on 
 any shared virtual platform for large-scale production, just because your IO 
 performance is going to be really hard to predict.  Lots of people do, though 
 -- depends on your cassandra loads and how consistent you need to have 
 performance be, as well as how much of your working set will fit into memory. 
  Remember that linode significantly oversells their CPU as well.
 
 The release version of KVM, at least as of a few months ago, still doesn't 
 support TRIM on SSD; that, plus the fact that you don't know how others will 
 use SSDs or if their file systems will keep the SSDs healthy, means that SSD 
 performance on KVM is going to be highly unpredictable.  I have not tested 
 digitalocean, but I did test several other KVM+SSD shared-tenant hosting 
 providers aggressively for cassandra a couple months ago; they all failed 
 badly.
 
 Your mileage will vary considerably based on what you need out of cassandra, 
 what your data patterns look like, and how you configure your system.  That 
 said, I would use xen before KVM for high-performance IO.
 
 I have not run Cassandra in any volume on Amazon -- lots of folks have, and 
 may have recommendations (including SSD) there for where it falls on the 
 price/performance curve.
 
 --DRS
 
 On Aug 3, 2013, at 11:33 AM, Ertio Lew ertio...@gmail.com wrote:
 
  I am building a cluster(initially starting with a 2-3 nodes cluster). I 
  have came across two seemingly good options for hosting, Linode  Digital 
  Ocean. VPS configuration for both listed below:
 
 
  Linode:-
  --
  XEN Virtualization
  2 GB RAM
  8 cores CPU (2x priority) (8 processor Xen instances)
  96 GB Storage
 
 
  Digital Ocean:-
  -
  KVM Virtualization
  2GB Memory
  2 Cores
  40GB **SSD Disk***
  Digitial Ocean's VPS is at half price of above listed Linode VPS,
 
 
  Could you clarify which of these two VPS would be better as Cassandra nodes 
  ?
 
 
 
 



Re: Which of these VPS configurations would perform better for Cassandra ?

2013-08-04 Thread Radim Kolar
workable configuration depends on your requirements. You need to develop 
own testing procedure.


How much data will have
whats 95 percentile response time target
size of rows
number of columns per row
data grow rate
data rewrite rate
ttl expiration used

never aim for minimum. Cassandra has huge difference during load spikes.


Re: org.apache.cassandra.io.sstable.CorruptSSTableException

2013-08-04 Thread Keith Wright
Re-sending hoping to get some help.  Any ideas would be much appreciated!

From: Keith Wright kwri...@nanigans.commailto:kwri...@nanigans.com
Date: Friday, August 2, 2013 3:01 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: org.apache.cassandra.io.sstable.CorruptSSTableException

Hi all,

   We just added a node to our cluster (1.2.4 Vnodes) and they appear to be 
running well exception I see that the new node is not making any progress 
compacting one of the CF.  The exception below is generated.  My assumption is 
that the only way to handle this is to stop the node, delete the file in 
question, restart, and run repair.

Thoughts?

org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
dataSize of 1249463589142530 starting at 5604968 would be larger than file 
/data/3/cassandra/data/users/global_user/users-global_user-ib-1550-Data.db 
length 14017479
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:168)
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:83)
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:69)
at 
org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:177)
at 
org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:152)
at 
org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:139)
at 
org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:36)
at 
org.apache.cassandra.db.compaction.ParallelCompactionIterable$Deserializer$1.runMayThrow(ParallelCompactionIterable.java:288)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException: dataSize of 1249463589142530 starting at 
5604968 would be larger than file 
/data/3/cassandra/data/users/global_user/users-global_user-ib-1550-Data.db 
length 14017479
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:123)
... 9 more



Re: Which of these VPS configurations would perform better for Cassandra ?

2013-08-04 Thread Ben Bromhead
If you want to get a rough idea of how things will perform, fire up YCSB 
(https://github.com/brianfrankcooper/YCSB/wiki) and run the tests that closest 
match how you think your workload will be (run the test clients from a couple 
of beefy AWS spot-instances for less than a dollar). As you are a new startup 
without any existing load/traffic patterns, benchmarking will be your best bet.

As a have a look at running Cassandra with SmartOS on Joyent. When you run 
SmartOS on Joyent virtualisation is done using solaris zones, an OS based 
virtualisation, which is at least a quadrillion times better than KVM, xen etc. 

Ok maybe not that much… but it is pretty cool and has the following benefits:

- No hardware emulation.
- Shared kernel with the host (you don't have to waste precious memory running 
a guest os).
- ZFS :)

Have a read of http://wiki.smartos.org/display/DOC/SmartOS+Virtualization for 
more info.

There are some downsides as well:

The version of Cassandra that comes with the SmartOS package management system 
is old and busted, so you will want to build from source. 
You will want to be technically confident in running on something a little 
outside the norm (SmartOS is based on Solaris).

Just make sure you test and benchmark all your options, a few days of testing 
now will save you weeks of pain.

Good luck!

Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr 


On 05/08/2013, at 12:34 AM, David Schairer dschai...@humbaba.net wrote:

 Of course -- my point is simply that if you're looking for speed, SSD+KVM, 
 especially in a shared tenant situation, is unlikely to perform the way you 
 want to.  If you're building a pure proof of concept that never stresses the 
 system, it doesn't matter, but if you plan an MVP with any sort of scale, 
 you'll want a plan to be on something more robust.  
 
 I'll also say that it's really important (imho) to be doing even your dev in 
 a config where you have consistency conditions like eventual production -- so 
 make sure you're writing to both nodes and can have cases where eventual 
 consistency delays kick in, or it'll come back to bite you later -- I've seen 
 this force people to redesign their whole data model when they don't plan for 
 it initially.  
 
 As I said, I haven't tested DO.  I've tested very similar configurations at 
 other providers and they were all terrible under load -- and certainly took 
 away most of the benefits of SSD once you stressed writes a bit.  XEN+SSD, on 
 modern kernels, should work better, but I didn't test it (linode doesn't 
 offer this, though, and they've had lots of other challenges of late).  
 
 --DRS
 
 On Aug 3, 2013, at 11:40 PM, Ertio Lew ertio...@gmail.com wrote:
 
 @David:
 Like all other start-ups, we too cannot start with all dedicated servers for 
 Cassandra. So right now we have no better choice except for using a VPS :), 
 but we can definitely choose one from amongst a suitable set of VPS 
 configurations. As of now since we are starting out, could we initiate our 
 cluster with 2 nodes(RF=2), (KVM, 2GB ram, 2 cores, 30GB SDD) . Right now we 
 wont we having a very heavy load on Cassandra until a next few months till 
 we grow our user base. So, this choice is mainly based on the pricing vs 
 configuration as well as digital ocean's good reputation in the community.
 
 
 On Sun, Aug 4, 2013 at 12:53 AM, David Schairer dschai...@humbaba.net 
 wrote:
 I've run several lab configurations on linodes; I wouldn't run cassandra on 
 any shared virtual platform for large-scale production, just because your IO 
 performance is going to be really hard to predict.  Lots of people do, 
 though -- depends on your cassandra loads and how consistent you need to 
 have performance be, as well as how much of your working set will fit into 
 memory.  Remember that linode significantly oversells their CPU as well.
 
 The release version of KVM, at least as of a few months ago, still doesn't 
 support TRIM on SSD; that, plus the fact that you don't know how others will 
 use SSDs or if their file systems will keep the SSDs healthy, means that SSD 
 performance on KVM is going to be highly unpredictable.  I have not tested 
 digitalocean, but I did test several other KVM+SSD shared-tenant hosting 
 providers aggressively for cassandra a couple months ago; they all failed 
 badly.
 
 Your mileage will vary considerably based on what you need out of cassandra, 
 what your data patterns look like, and how you configure your system.  That 
 said, I would use xen before KVM for high-performance IO.
 
 I have not run Cassandra in any volume on Amazon -- lots of folks have, and 
 may have recommendations (including SSD) there for where it falls on the 
 price/performance curve.
 
 --DRS
 
 On Aug 3, 2013, at 11:33 AM, Ertio Lew ertio...@gmail.com wrote:
 
 I am building a cluster(initially starting with a 2-3 nodes cluster). I 
 have came across two seemingly good options for hosting, Linode  Digital 
 Ocean. 

Better to have lower or greater cardinality for partition key in CQL3?

2013-08-04 Thread David Ward
Hello,
 Was curious what people had found to be better for
structuring/modeling data into C*?   With my data I have two primary
keys, one 64 bit int thats 0 - 50 million ( its unlikely to go higher
then 70 million ever ) and another 64 bit that's probably close to
hitting a trillion in the next year or so.   Looking at how the data
is going to behave, for the first few months each row/record will be
updated but after that its practically written in stone.  Still I was
leaning toward leveled compaction as it gets updated anywhere from
once an hour to at least once a day for the first 7 days.

So from anyones experience, is it better to use a low cardinality
partition key or a high cardinality.   Additionally data organized by
the low cardinality set is probably 1-6B ( and growing ) but the high
cardinality would be 1-6MB only 2-3x a year.


Thanks,
   Dave


new high cardinality keys in 1 year ~15,768,00,000
new low cardinality keys in 1 year = 10,000-30,000

low cardinality key set size ~1-6GB
high cardinality key set size 1-5MB