Re: disk space and tombstones

2014-08-19 Thread Vitaly Chirkov


DuyHai Doan wrote
 it looks like there is a need for a tool to take care of the bucketing
 switch

But I still can't understand why bucketing should be better than `DELETE row
USING TIMESTAMP`. Looks like the only source of truth about this topic is
the source code of Cassa.



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/disk-space-and-tombstones-tp7596356p7596378.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: cassandra-stress with clustering columns?

2014-08-19 Thread Mikhail Stepura
Are you interested in cassandra-stress in particular? Or in any tool 
which will allow you to stress test your schema?

I believe Apache Jmeter + CQL plugin may be useful in the latter case.

https://github.com/Mishail/CqlJmeter

-M


On 8/17/14 12:26, Clint Kelly wrote:

Hi all,

Is there a way to use the cassandra-stress tool with clustering columns?

I am trying to figure out whether an application that I'm running on
is slow because of my application logic, C* data model, or underlying
C* setup (e.g., I need more nodes or to tune some parameters).

My application uses tables with several clustering columns and a
couple of additional indices and it is running quite slowly under a
heavy write load.  I think that the problem is my data model (and
therefore table layout), but I'd like to confirm by replicating the
problem with cassandra-stress.

I don't see any option for using clustering columns or secondary
indices, but I wanted to check before diving into the code and trying
to add this functionality.

Thanks!

Best regards,
Clint





Re: Best way to format a ResultSet / Row ?

2014-08-19 Thread Fabrice Larcher
Hello,

I would try something like that (I have not tested, no guarantee ..) :

import com.datastax.driver.core.ColumnDefinitions;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;
import com.datastax.driver.core.utils.Bytes;

/* ... */

ResultSet result = null; // Put your instance HERE
final StringBuilder builder = new StringBuilder();
for (Row row : result) {
builder.append([ );
for (ColumnDefinitions.Definition def :
row.getColumnDefinitions()) {
String value =
Bytes.toHexString(row.getBytesUnsafe(def.getName()));

builder.append(def.getName()).append(=).append(value).append( );
}
builder.append(] );
}
System.out.println(builder.toString());

/* ... */

But this is probably not very usefull, since you get only prints of bytes.
You can then test the type of the column (variable 'def') in order to call
the best suited method of 'row', so that the variable 'value' can be more
readable.


Fabrice LARCHER


2014-08-19 3:29 GMT+02:00 Kevin Burton bur...@spinn3r.com:

 The DataStax java driver has a Row object which getInt, getLong methods…

 However, the getString only works on string columns.

 That's probably reasonable… but if I have a raw Row, how the heck do I
 easily print it?

 I need a handy way do dump a ResultSet …

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com




[RELEASE CANDIDATE] Apache Cassandra 2.1.0-rc6 released

2014-08-19 Thread Sylvain Lebresne
The Cassandra team is pleased to announce the sixth release candidate for
the
future Apache Cassandra version 2.1.0.

Please note that this is not yet the final 2.1.0 release and as such, it
should
not be considered for production use. We'd appreciate testing and let us
know
if you encounter any problem[3,4]. Please make sure to have a look at the
change log[1] and release notes[2].

Apache Cassandra 2.1.0-rc6[5] is available as usual from the cassandra
website (http://cassandra.apache.org/download/) and a debian package is
available using the 21x branch (see
http://wiki.apache.org/cassandra/DebianPackaging).

Enjoy!

[1]: http://goo.gl/MyqArD (CHANGES.txt)
[2]: http://goo.gl/7vS47U (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA
[4]: user@cassandra.apache.org
[5]:
http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-2.1.0-rc6


Re: Best way to format a ResultSet / Row ?

2014-08-19 Thread Sylvain Lebresne
This kind of question belong to the java driver mailing list, not the
Cassandra one, please try to use the proper mailing list in the future.

On Tue, Aug 19, 2014 at 10:11 AM, Fabrice Larcher fabrice.larc...@level5.fr
 wrote:


 But this is probably not very usefull, since you get only prints of bytes.
 You can then test the type of the column (variable 'def') in order to call
 the best suited method of 'row',


You don't have to test the type, you can just the deserialize method of
the column type. So in Fabrice's example,
  Object val = def.getType().deserialize(row.getBytesUnsafe(def.getName()));

--
Sylvain


Options for expanding Cassandra cluster on AWS

2014-08-19 Thread Oleg Dulin

Distinguished Colleagues:

Our current Cassandra cluster on AWS looks like this:

3 nodes in N. Virginia, one per zone.
RF=3

Each node is a c3.4xlarge with 2x160G SSDs in RAID-0 (~300 Gig SSD on 
each node). Works great, I find it the most optimal configuration for a 
Cassandra node.


But the time is coming soon when I need to expand storage capacity.

I have the following options in front of me:

1) Add 3 more c3.4xlarge nodes. This keeps the amount of data on each 
node reasonable, and all repairs and other tasks can complete in a 
reasonable amount of time. The downside is that c3.4xlarge are pricey.


2) Add provisioned EBS volumes. These days I can get SSD-backed EBS 
with up to 4000 IOPS provisioned. I can add those volumes to 
data_directories list in Yaml, and I expect Cassandra can deal with 
that JBOD-style The upside is that it is much cheaper than option 
#1 above; the downside is that it is a much slower configuration and 
repairs can take longer.


I'd appreciate any input on this topic.

Thanks in advance,
Oleg




Re: Options for expanding Cassandra cluster on AWS

2014-08-19 Thread Brian Tarbox
The last guidance I heard from DataStax was to use m2.2xlarge's on AWS and
put data on the ephemeral drivehave they changed this guidance?

Brian


On Tue, Aug 19, 2014 at 9:41 AM, Oleg Dulin oleg.du...@gmail.com wrote:

 Distinguished Colleagues:

 Our current Cassandra cluster on AWS looks like this:

 3 nodes in N. Virginia, one per zone.
 RF=3

 Each node is a c3.4xlarge with 2x160G SSDs in RAID-0 (~300 Gig SSD on each
 node). Works great, I find it the most optimal configuration for a
 Cassandra node.

 But the time is coming soon when I need to expand storage capacity.

 I have the following options in front of me:

 1) Add 3 more c3.4xlarge nodes. This keeps the amount of data on each node
 reasonable, and all repairs and other tasks can complete in a reasonable
 amount of time. The downside is that c3.4xlarge are pricey.

 2) Add provisioned EBS volumes. These days I can get SSD-backed EBS with
 up to 4000 IOPS provisioned. I can add those volumes to data_directories
 list in Yaml, and I expect Cassandra can deal with that JBOD-style The
 upside is that it is much cheaper than option #1 above; the downside is
 that it is a much slower configuration and repairs can take longer.

 I'd appreciate any input on this topic.

 Thanks in advance,
 Oleg





-- 
http://about.me/BrianTarbox


Re: Options for expanding Cassandra cluster on AWS

2014-08-19 Thread Russell Bradberry
I’m not sure about Datastax’s official stance but using the SSD backed 
instances (ed. i2.2xl, c3.4xl etc) outperform the m2.2xl greatly. Also, since 
Datastax is pro-ssd, I doubt they would still recommend to stay on magnetic 
disks.

That said, I have benchmarked all the way up to the c3.8xl instances.  The most 
IOPS I could get out of each node was around 4000-5000.  This seemed to be 
because the context switching was preventing Cassandra from stressing the SSD 
drives to their maximum of 40,000 IOPS.

Since the SSD backed EBS volumes offer up to 4000 IOPS, the speed of the disk 
would not be an issue.  You would, however, still be sharing network resources, 
so without a proper benchmark you would still be rolling the dice.

The best bang for the buck I’ve seen is the i2 instances.  They offer more 
ephemeral disk space at less of a cost than the c3, albeit less cpu. We 
currently use the i2.xlrg and they are working out great.  



On August 19, 2014 at 10:09:26 AM, Brian Tarbox (briantar...@gmail.com) wrote:

The last guidance I heard from DataStax was to use m2.2xlarge's on AWS and put 
data on the ephemeral drivehave they changed this guidance?

Brian


On Tue, Aug 19, 2014 at 9:41 AM, Oleg Dulin oleg.du...@gmail.com wrote:
Distinguished Colleagues:

Our current Cassandra cluster on AWS looks like this:

3 nodes in N. Virginia, one per zone.
RF=3

Each node is a c3.4xlarge with 2x160G SSDs in RAID-0 (~300 Gig SSD on each 
node). Works great, I find it the most optimal configuration for a Cassandra 
node.

But the time is coming soon when I need to expand storage capacity.

I have the following options in front of me:

1) Add 3 more c3.4xlarge nodes. This keeps the amount of data on each node 
reasonable, and all repairs and other tasks can complete in a reasonable amount 
of time. The downside is that c3.4xlarge are pricey.

2) Add provisioned EBS volumes. These days I can get SSD-backed EBS with up to 
4000 IOPS provisioned. I can add those volumes to data_directories list in 
Yaml, and I expect Cassandra can deal with that JBOD-style The upside is 
that it is much cheaper than option #1 above; the downside is that it is a much 
slower configuration and repairs can take longer.

I'd appreciate any input on this topic.

Thanks in advance,
Oleg





--
http://about.me/BrianTarbox

Re: cassandra-stress with clustering columns?

2014-08-19 Thread Clint Kelly
Hi Mikail,

This plugin looks great!  I have actually been using JMeter + a custom
REST endpoint driving Cassandra.  It would be great to compare the
results I got from that against the pure JMeter + Cassandra (to
evaluate the REST endpoint's performance).

Thanks!  I'll check this out.

Best regards,
Clint


On Tue, Aug 19, 2014 at 1:38 AM, Mikhail Stepura
mikhail.step...@outlook.com wrote:
 Are you interested in cassandra-stress in particular? Or in any tool which
 will allow you to stress test your schema?
 I believe Apache Jmeter + CQL plugin may be useful in the latter case.

 https://github.com/Mishail/CqlJmeter

 -M



 On 8/17/14 12:26, Clint Kelly wrote:

 Hi all,

 Is there a way to use the cassandra-stress tool with clustering columns?

 I am trying to figure out whether an application that I'm running on
 is slow because of my application logic, C* data model, or underlying
 C* setup (e.g., I need more nodes or to tune some parameters).

 My application uses tables with several clustering columns and a
 couple of additional indices and it is running quite slowly under a
 heavy write load.  I think that the problem is my data model (and
 therefore table layout), but I'd like to confirm by replicating the
 problem with cassandra-stress.

 I don't see any option for using clustering columns or secondary
 indices, but I wanted to check before diving into the code and trying
 to add this functionality.

 Thanks!

 Best regards,
 Clint




Re: cassandra-stress with clustering columns?

2014-08-19 Thread Benedict Elliott Smith
The stress tool in 2.1 also now supports clustering columns:
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

There are however some features up for revision before release in order to
help generate realistic workloads. See
https://issues.apache.org/jira/browse/CASSANDRA-7519 for details.


On Tue, Aug 19, 2014 at 10:46 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi Mikail,

 This plugin looks great!  I have actually been using JMeter + a custom
 REST endpoint driving Cassandra.  It would be great to compare the
 results I got from that against the pure JMeter + Cassandra (to
 evaluate the REST endpoint's performance).

 Thanks!  I'll check this out.

 Best regards,
 Clint


 On Tue, Aug 19, 2014 at 1:38 AM, Mikhail Stepura
 mikhail.step...@outlook.com wrote:
  Are you interested in cassandra-stress in particular? Or in any tool
 which
  will allow you to stress test your schema?
  I believe Apache Jmeter + CQL plugin may be useful in the latter case.
 
  https://github.com/Mishail/CqlJmeter
 
  -M
 
 
 
  On 8/17/14 12:26, Clint Kelly wrote:
 
  Hi all,
 
  Is there a way to use the cassandra-stress tool with clustering columns?
 
  I am trying to figure out whether an application that I'm running on
  is slow because of my application logic, C* data model, or underlying
  C* setup (e.g., I need more nodes or to tune some parameters).
 
  My application uses tables with several clustering columns and a
  couple of additional indices and it is running quite slowly under a
  heavy write load.  I think that the problem is my data model (and
  therefore table layout), but I'd like to confirm by replicating the
  problem with cassandra-stress.
 
  I don't see any option for using clustering columns or secondary
  indices, but I wanted to check before diving into the code and trying
  to add this functionality.
 
  Thanks!
 
  Best regards,
  Clint
 
 



EC2 SSD cluster costs

2014-08-19 Thread Jeremy Jongsma
The latest consensus around the web for running Cassandra on EC2 seems to
be use new SSD instances. I've not seen any mention of the elephant in
the room - using the new SSD instances significantly raises the cluster
cost per TB. With Cassandra's strength being linear scalability to many
terabytes of data, it strikes me as odd that everyone is recommending such
a large storage cost hike almost without reservation.

Monthly cost comparison for a 100TB cluster (non-reserved instances):

m1.xlarge (2x420 non-SSD): $30,000 (120 nodes)
m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option)
i2.xlarge (1x800 SSD): $76,000 (125 nodes)

Best case, the cost goes up 150%. How are others approaching these new
instances? Have you migrated and eaten the costs, or are you staying on
previous generation until prices come down?


Re: EC2 SSD cluster costs

2014-08-19 Thread Russell Bradberry
Short answer, it depends on your use-case.

We migrated to i2.xlarge nodes and saw an immediate increase in performance.  
If you just need plain ole raw disk space and don’t have a performance 
requirement to meet then the m1 machines would work, or hell even SSD EBS 
volumes may work for you.  The problem we were having is that we couldn’t fill 
the m1 machines because we needed to add more nodes for performance.  Now we 
have much more power and just the right amount of disk space.

Basically saying, these are not apples-to-apples comparisons



On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com) wrote:

The latest consensus around the web for running Cassandra on EC2 seems to be 
use new SSD instances. I've not seen any mention of the elephant in the room 
- using the new SSD instances significantly raises the cluster cost per TB. 
With Cassandra's strength being linear scalability to many terabytes of data, 
it strikes me as odd that everyone is recommending such a large storage cost 
hike almost without reservation.

Monthly cost comparison for a 100TB cluster (non-reserved instances):

m1.xlarge (2x420 non-SSD): $30,000 (120 nodes)
m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option)
i2.xlarge (1x800 SSD): $76,000 (125 nodes)

Best case, the cost goes up 150%. How are others approaching these new 
instances? Have you migrated and eaten the costs, or are you staying on 
previous generation until prices come down?

Re: EC2 SSD cluster costs

2014-08-19 Thread Kevin Burton
You're pricing it out at $ per GB… that's not the way to look at it.

Price it out at $ per IO… Once you price it that way, SSD makes a LOT more
sense.

Of course, it depends on your workload.  If you're just doing writes, and
they're all sequential, then cost per IO might not make a lot of sense.

We're VERY IO bound… so for us SSD is a no brainer.

We were actually all memory before because of this and just finished a big
SSD migration … (though on MySQL)…

But our Cassandra deploy will be on SSD on Softlayer.

It's a no brainer really..

Kevin


On Tue, Aug 19, 2014 at 8:56 AM, Jeremy Jongsma jer...@barchart.com wrote:

 The latest consensus around the web for running Cassandra on EC2 seems to
 be use new SSD instances. I've not seen any mention of the elephant in
 the room - using the new SSD instances significantly raises the cluster
 cost per TB. With Cassandra's strength being linear scalability to many
 terabytes of data, it strikes me as odd that everyone is recommending such
 a large storage cost hike almost without reservation.

 Monthly cost comparison for a 100TB cluster (non-reserved instances):

 m1.xlarge (2x420 non-SSD): $30,000 (120 nodes)
 m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option)
 i2.xlarge (1x800 SSD): $76,000 (125 nodes)

 Best case, the cost goes up 150%. How are others approaching these new
 instances? Have you migrated and eaten the costs, or are you staying on
 previous generation until prices come down?




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Re: cassandra-stress with clustering columns?

2014-08-19 Thread Clint Kelly
Thanks for the update, Benedict.  We are still using 2.0.9
unfortunately.  :/   I will keep that in mind for when we upgrade.

On Tue, Aug 19, 2014 at 10:51 AM, Benedict Elliott Smith
belliottsm...@datastax.com wrote:
 The stress tool in 2.1 also now supports clustering columns:
 http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

 There are however some features up for revision before release in order to
 help generate realistic workloads. See
 https://issues.apache.org/jira/browse/CASSANDRA-7519 for details.


 On Tue, Aug 19, 2014 at 10:46 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi Mikail,

 This plugin looks great!  I have actually been using JMeter + a custom
 REST endpoint driving Cassandra.  It would be great to compare the
 results I got from that against the pure JMeter + Cassandra (to
 evaluate the REST endpoint's performance).

 Thanks!  I'll check this out.

 Best regards,
 Clint


 On Tue, Aug 19, 2014 at 1:38 AM, Mikhail Stepura
 mikhail.step...@outlook.com wrote:
  Are you interested in cassandra-stress in particular? Or in any tool
  which
  will allow you to stress test your schema?
  I believe Apache Jmeter + CQL plugin may be useful in the latter case.
 
  https://github.com/Mishail/CqlJmeter
 
  -M
 
 
 
  On 8/17/14 12:26, Clint Kelly wrote:
 
  Hi all,
 
  Is there a way to use the cassandra-stress tool with clustering
  columns?
 
  I am trying to figure out whether an application that I'm running on
  is slow because of my application logic, C* data model, or underlying
  C* setup (e.g., I need more nodes or to tune some parameters).
 
  My application uses tables with several clustering columns and a
  couple of additional indices and it is running quite slowly under a
  heavy write load.  I think that the problem is my data model (and
  therefore table layout), but I'd like to confirm by replicating the
  problem with cassandra-stress.
 
  I don't see any option for using clustering columns or secondary
  indices, but I wanted to check before diving into the code and trying
  to add this functionality.
 
  Thanks!
 
  Best regards,
  Clint
 
 




Re: Best way to format a ResultSet / Row ?

2014-08-19 Thread Kevin Burton
I agree that it belongs on that mailing list but it's setup weird.. .I
can't subscribe to it in Google Groups… I am not sure what exactly is wrong
with it.. mailed the admins but it hasn't been resolved.


On Tue, Aug 19, 2014 at 1:49 AM, Sylvain Lebresne sylv...@datastax.com
wrote:

 This kind of question belong to the java driver mailing list, not the
 Cassandra one, please try to use the proper mailing list in the future.

 On Tue, Aug 19, 2014 at 10:11 AM, Fabrice Larcher 
 fabrice.larc...@level5.fr wrote:


 But this is probably not very usefull, since you get only prints of
 bytes. You can then test the type of the column (variable 'def') in order
 to call the best suited method of 'row',


 You don't have to test the type, you can just the deserialize method of
 the column type. So in Fabrice's example,
   Object val =
 def.getType().deserialize(row.getBytesUnsafe(def.getName()));

 --
 Sylvain




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Re: EC2 SSD cluster costs

2014-08-19 Thread Shane Hansen
Again, depends on your use case.
But we wanted to keep the data per node below 500gb,
and we found raided ssds to be the best bang for the buck
for our cluster. I think we moved to from the i2 to c3 because
our bottleneck tended to be CPU utilization (from parsing requests).



(Discliamer, we're not cassandra veterans but we're not part of the RF=N=3
club)



On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com
wrote:

 Short answer, it depends on your use-case.

 We migrated to i2.xlarge nodes and saw an immediate increase in
 performance.  If you just need plain ole raw disk space and don’t have a
 performance requirement to meet then the m1 machines would work, or hell
 even SSD EBS volumes may work for you.  The problem we were having is that
 we couldn’t fill the m1 machines because we needed to add more nodes for
 performance.  Now we have much more power and just the right amount of disk
 space.

 Basically saying, these are not apples-to-apples comparisons



 On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com)
 wrote:

 The latest consensus around the web for running Cassandra on EC2 seems to
 be use new SSD instances. I've not seen any mention of the elephant in
 the room - using the new SSD instances significantly raises the cluster
 cost per TB. With Cassandra's strength being linear scalability to many
 terabytes of data, it strikes me as odd that everyone is recommending such
 a large storage cost hike almost without reservation.

 Monthly cost comparison for a 100TB cluster (non-reserved instances):

 m1.xlarge (2x420 non-SSD): $30,000 (120 nodes)
 m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option)
 i2.xlarge (1x800 SSD): $76,000 (125 nodes)

 Best case, the cost goes up 150%. How are others approaching these new
 instances? Have you migrated and eaten the costs, or are you staying on
 previous generation until prices come down?




Manually deleting sstables

2014-08-19 Thread Parag Patel
After we dropped a table, we noticed that the sstables are still there.  After 
searching through the forum history, I noticed that this is known behavior.


1)  Is there any negative impact of deleting the sstables off disk and then 
restarting Cassandra?

2)  Are there any other recommended procedures for this?

Thanks,
Parag


Re: cassandra-stress with clustering columns?

2014-08-19 Thread Benedict Elliott Smith
The stress tool will work against any version of Cassandra, it's only
released alongside for ease of deployment. You can safely use the tool from
pre-release versions.


On Tue, Aug 19, 2014 at 11:03 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Thanks for the update, Benedict.  We are still using 2.0.9
 unfortunately.  :/   I will keep that in mind for when we upgrade.

 On Tue, Aug 19, 2014 at 10:51 AM, Benedict Elliott Smith
 belliottsm...@datastax.com wrote:
  The stress tool in 2.1 also now supports clustering columns:
 
 http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
 
  There are however some features up for revision before release in order
 to
  help generate realistic workloads. See
  https://issues.apache.org/jira/browse/CASSANDRA-7519 for details.
 
 
  On Tue, Aug 19, 2014 at 10:46 PM, Clint Kelly clint.ke...@gmail.com
 wrote:
 
  Hi Mikail,
 
  This plugin looks great!  I have actually been using JMeter + a custom
  REST endpoint driving Cassandra.  It would be great to compare the
  results I got from that against the pure JMeter + Cassandra (to
  evaluate the REST endpoint's performance).
 
  Thanks!  I'll check this out.
 
  Best regards,
  Clint
 
 
  On Tue, Aug 19, 2014 at 1:38 AM, Mikhail Stepura
  mikhail.step...@outlook.com wrote:
   Are you interested in cassandra-stress in particular? Or in any tool
   which
   will allow you to stress test your schema?
   I believe Apache Jmeter + CQL plugin may be useful in the latter case.
  
   https://github.com/Mishail/CqlJmeter
  
   -M
  
  
  
   On 8/17/14 12:26, Clint Kelly wrote:
  
   Hi all,
  
   Is there a way to use the cassandra-stress tool with clustering
   columns?
  
   I am trying to figure out whether an application that I'm running on
   is slow because of my application logic, C* data model, or underlying
   C* setup (e.g., I need more nodes or to tune some parameters).
  
   My application uses tables with several clustering columns and a
   couple of additional indices and it is running quite slowly under a
   heavy write load.  I think that the problem is my data model (and
   therefore table layout), but I'd like to confirm by replicating the
   problem with cassandra-stress.
  
   I don't see any option for using clustering columns or secondary
   indices, but I wanted to check before diving into the code and trying
   to add this functionality.
  
   Thanks!
  
   Best regards,
   Clint
  
  
 
 



Re: [RELEASE CANDIDATE] Apache Cassandra 2.1.0-rc6 released

2014-08-19 Thread Tony Anecito
That is great news keep up the great work!

Best Regards,
Tony Anecito
Founder/PresidentMyUniPortal LLC
http://www.myuniportal.com



On Tuesday, August 19, 2014 2:38 AM, Sylvain Lebresne sylv...@datastax.com 
wrote:
 


The Cassandra team is pleased to announce the sixth release candidate for the
future Apache Cassandra version 2.1.0.

Please note that this is not yet the final 2.1.0 release and as such, it should
not be considered for production use. We'd appreciate testing and let us know
if you encounter any problem[3,4]. Please make sure to have a look at the
change log[1] and release notes[2].

Apache Cassandra 2.1.0-rc6[5] is available as usual from the cassandra
website (http://cassandra.apache.org/download/) and a debian package is
available using the 21x branch (see 
http://wiki.apache.org/cassandra/DebianPackaging).

Enjoy!

[1]: http://goo.gl/MyqArD (CHANGES.txt)
[2]: http://goo.gl/7vS47U (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA
[4]: user@cassandra.apache.org
[5]: 
http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-2.1.0-rc6

Re: LOCAL_QUORUM without a replica in current data center

2014-08-19 Thread Viswanathan Ramachandran
Sorry for the spam - but I wanted to double check if anyone had experience
with such a scenario.

Thanks.



On Sun, Aug 17, 2014 at 7:11 PM, Viswanathan Ramachandran 
vish.ramachand...@gmail.com wrote:

 Hi,

 How does LOCAL_QUORUM read/write behave when the data center on which
 query is executed does not have a replica of the keyspace?

 Does it result in an error or can it be configured to do LOCAL_QUORUM on
 the nearest data center (as depicted by the dynamic snitch) which has the
 replicas ?

 We are essentially trying to design a Cassandra cluster with a keyspace
 only in certain regional-hub data centers to keep number of replicas
 under control.
 I am curious to know if a cassandra node not in the regional-hub data
 center can handle LOCAL_QUORUM type operations, or if clients really need
 to have a connection to the hub data center with the replica to use that
 consistency level.

 Thanks
 Vish








Re: Manually deleting sstables

2014-08-19 Thread Robert Coli
On Tue, Aug 19, 2014 at 8:59 AM, Parag Patel ppa...@clearpoolgroup.com
wrote:

  After we dropped a table, we noticed that the sstables are still there.
 After searching through the forum history, I noticed that this is known
 behavior.


Yes, it's providing protection in this case, though many people do not
expect this.

  1)  Is there any negative impact of deleting the sstables off disk
 and then restarting Cassandra?

You don't have to restart Cassandra, and no.

  2)  Are there any other recommended procedures for this?

0) stop writes to columnfamily
1) TRUNCATE columnfamily;
2) nodetool clearsnapshot # on the snapshot that results
3) DROP columnfamily;

=Rob


Re: EC2 SSD cluster costs

2014-08-19 Thread Paulo Ricardo Motta Gomes
Still using good ol' m1.xlarge here + external caching (memcached). Trying
to adapt our use case to have different clusters for different use cases so
we can leverage SSD at an acceptable cost in some of them.


On Tue, Aug 19, 2014 at 1:05 PM, Shane Hansen shanemhan...@gmail.com
wrote:

 Again, depends on your use case.
 But we wanted to keep the data per node below 500gb,
 and we found raided ssds to be the best bang for the buck
 for our cluster. I think we moved to from the i2 to c3 because
 our bottleneck tended to be CPU utilization (from parsing requests).



 (Discliamer, we're not cassandra veterans but we're not part of the RF=N=3
 club)



 On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

 Short answer, it depends on your use-case.

 We migrated to i2.xlarge nodes and saw an immediate increase in
 performance.  If you just need plain ole raw disk space and don’t have a
 performance requirement to meet then the m1 machines would work, or hell
 even SSD EBS volumes may work for you.  The problem we were having is that
 we couldn’t fill the m1 machines because we needed to add more nodes for
 performance.  Now we have much more power and just the right amount of disk
 space.

 Basically saying, these are not apples-to-apples comparisons



 On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com)
 wrote:

 The latest consensus around the web for running Cassandra on EC2 seems to
 be use new SSD instances. I've not seen any mention of the elephant in
 the room - using the new SSD instances significantly raises the cluster
 cost per TB. With Cassandra's strength being linear scalability to many
 terabytes of data, it strikes me as odd that everyone is recommending such
 a large storage cost hike almost without reservation.

 Monthly cost comparison for a 100TB cluster (non-reserved instances):

 m1.xlarge (2x420 non-SSD): $30,000 (120 nodes)
 m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option)
 i2.xlarge (1x800 SSD): $76,000 (125 nodes)

 Best case, the cost goes up 150%. How are others approaching these new
 instances? Have you migrated and eaten the costs, or are you staying on
 previous generation until prices come down?





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: EC2 SSD cluster costs

2014-08-19 Thread Aiman Parvaiz
I completely agree with others here. It depends on your use case. We were
using Hi1.4xlarge boxes and paying huge amount to Amazon, lately our
requirements changed and we are not hammering C* as much and our data size
has gone down too, so given the new conditions we reserved and migrated to
c3.4xlarges to save quite a lot of money.


On Aug 19, 2014, at 10:25 AM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

Still using good ol' m1.xlarge here + external caching (memcached). Trying
to adapt our use case to have different clusters for different use cases so
we can leverage SSD at an acceptable cost in some of them.


On Tue, Aug 19, 2014 at 1:05 PM, Shane Hansen shanemhan...@gmail.com
wrote:

 Again, depends on your use case.
 But we wanted to keep the data per node below 500gb,
 and we found raided ssds to be the best bang for the buck
 for our cluster. I think we moved to from the i2 to c3 because
 our bottleneck tended to be CPU utilization (from parsing requests).



 (Discliamer, we're not cassandra veterans but we're not part of the RF=N=3
 club)



 On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

 Short answer, it depends on your use-case.

 We migrated to i2.xlarge nodes and saw an immediate increase in
 performance.  If you just need plain ole raw disk space and don’t have a
 performance requirement to meet then the m1 machines would work, or hell
 even SSD EBS volumes may work for you.  The problem we were having is that
 we couldn’t fill the m1 machines because we needed to add more nodes for
 performance.  Now we have much more power and just the right amount of disk
 space.

 Basically saying, these are not apples-to-apples comparisons



 On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com)
 wrote:

 The latest consensus around the web for running Cassandra on EC2 seems to
 be use new SSD instances. I've not seen any mention of the elephant in
 the room - using the new SSD instances significantly raises the cluster
 cost per TB. With Cassandra's strength being linear scalability to many
 terabytes of data, it strikes me as odd that everyone is recommending such
 a large storage cost hike almost without reservation.

 Monthly cost comparison for a 100TB cluster (non-reserved instances):

 m1.xlarge (2x420 non-SSD): $30,000 (120 nodes)
 m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option)
 i2.xlarge (1x800 SSD): $76,000 (125 nodes)

 Best case, the cost goes up 150%. How are others approaching these new
 instances? Have you migrated and eaten the costs, or are you staying on
 previous generation until prices come down?





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Cassandra Wiki Immutable?

2014-08-19 Thread Dave Brosius

added, thanks.

On 08/18/2014 06:15 AM, Otis Gospodnetic wrote:

Hi,

What is the state of Cassandra Wiki -- http://wiki.apache.org/cassandra ?

I tried to update a few pages, but it looks like pages are immutable. 
 Do I need to have my Wiki username (OtisGospodnetic) added to some ACL?


Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/




Cassandra Consistency Level

2014-08-19 Thread Check Peck
We have cassandra cluster in three different datacenters (DC1, DC2 and DC3)
and we have 10 machines in each datacenter. We have few tables in cassandra
in which we have less than 100 records.

What we are seeing - some tables are out of sync between machines in DC3 as
compared to DC1 or DC2 when we do select count(*) on it.

As an example we did select count(*) while connecting to one cassandra
machine in dc3 datacenter as compared to one cassandra machine in dc1
datacenter and the results were different.

root@machineA:/home/david/apache-cassandra/bin# python cqlsh
dc3114.dc3.host.com
Connected to TestCluster at dc3114.dc3.host.com:9160.
[cqlsh 2.3.0 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol
19.36.0]
Use HELP for help.
cqlsh use testingkeyspace ;
cqlsh:testingkeyspace select count(*) from test_metadata ;

count
---
12

cqlsh:testingkeyspace exit
root@machineA:/home/david/apache-cassandra/bin# python cqlsh
dc18b0c.dc1.host.com
Connected to TestCluster at dc18b0c.dc1.host.com:9160.
[cqlsh 2.3.0 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol
19.36.0]
Use HELP for help.
cqlsh use testingkeyspace ;
cqlsh:testingkeyspace select count(*) from test_metadata ;

count
---
16

What could be the reason for this sync issue? Can anyone shed  some light
on this?

Since our java driver code and datastax c++ driver code are using these
tables with CONSISTENCY LEVEL ONE.


Re: Cassandra Consistency Level

2014-08-19 Thread Robert Coli
On Tue, Aug 19, 2014 at 4:14 PM, Check Peck comptechge...@gmail.com wrote:

 What could be the reason for this sync issue? Can anyone shed  some light
 on this?

 Since our java driver code and datastax c++ driver code are using these
 tables with CONSISTENCY LEVEL ONE.


1) write with CL.ONE
2) get success response to client
3) replication times out to DC3 and is queued as a hint
4) SELECT COUNT(*) in DC3

You should be able to observe the storage and delivery of the hints in 3),
and 4) should eventually be correct as a result of their delivery.

=Rob


updated num_tokens value while changing replication factor and getting a nodetool repair error

2014-08-19 Thread Bryan Holladay
I have 1 DC that was originally 3 nodes each set with a single token:
'-9223372036854775808', '-3074457345618258603', '3074457345618258602'

I added two more nodes and ran nodetool move and nodetool cleanup one
server at a time with these tokens: '-9223372036854775808',
'-5534023222112865485', '-1844674407370955162', '1844674407370955161',
'5534023222112865484'

Everything looked good so I changed the replication factor for my keyspace
from 1 to 2 and started running nodetool repair on each node. The first
node ran for a while then threw an error:

 Repair session 8d2a1190-25aa-11e4-8a15-ff681618d551 for range
(1844674407370955161,5534023222112865484] failed with error
org.apache.cassandra.exceptions.RepairException: [repair
#8d2a1190-25aa-11e4-8a15-ff681618d551 on PLAGIARISM/STATS,
(1844674407370955161,5534023222112865484]] Validation failed in
/###.###.###

Since this was a temporary Column Family, I just dropped that CF and tried
to run nodetool repair again. It ended up giving me the same error with a
different CF. I tried nodetool cleanup and then nodetool repair again and
it eventually crashed the node (Status: DN). When I restarted Cassandra, it
still had the initial_token value as -9223372036854775808' but the
default num_tokens value was 256. When I checked the status it showed
that it had 256 tokens (while my other 4 nodes still had 1). Luckily, it
chose 256 tokens that were in the existing token range for that server, so
it had the same owns value.

My question is three fold:

1) Is it better to use 256 vnodes and just move the other 4 servers to have
256 tokens as well or is it possible (or better) to change the tokens back
to just a single token for the first server again? I did see warnings about
not using shuffle and the new method is to create a new DC and move it
over, but I don't have space to do this (the current DC is 31TB). I'm fine
with 256 tokens that are in the original token range, so would it be ok to
just not ever run the shuffle but add the 256 tokens to each server? If I
should change it back to 1 token per server again, is it possible to do so
w/o decommissioning and removing all existing data and restarting? With
having such a large dataset, I feel that nodetool repair works better when
there are more tokens (my theory is that its working in smaller chunks), is
this a good reason to use 256 instead of one?

2) How can I fix the Repair Exception above?

3) Nodetool repair takes forever to run (5+ days). Is this because I have 1
token per node or is there a better way to run this?  Should I set the
start and end keys?


I'm running Cassandra 2.0.2

Any help would be greatly appreciated.

Thanks,
Bryan


Re: Cassandra Consistency Level

2014-08-19 Thread Mark Reddy
Hi,

As you are writing as CL.ONE and cqlsh by default reads at CL.ONE, there is
a probability that you are reading stale data i.e. the node you have
contacted for the read may not have the most recent data. If you have a
higher consistency requirement, you should look at increasing your
consistency level, for a more detailed look at this see:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html

If you want to continue using CL.ONE, you could look at increasing the
read_repair_chance for better consistency.
http://www.datastax.com/documentation/cassandra/2.0/cassandra/reference/referenceTableAttributes.html

Just to verify that this is in fact a consistency issue could you run a
nodetool repair on that table and run the same queries again?


Mark

Regards,
Mark


On 20 August 2014 00:14, Check Peck comptechge...@gmail.com wrote:

 We have cassandra cluster in three different datacenters (DC1, DC2 and
 DC3) and we have 10 machines in each datacenter. We have few tables in
 cassandra in which we have less than 100 records.

 What we are seeing - some tables are out of sync between machines in DC3
 as compared to DC1 or DC2 when we do select count(*) on it.

 As an example we did select count(*) while connecting to one cassandra
 machine in dc3 datacenter as compared to one cassandra machine in dc1
 datacenter and the results were different.

 root@machineA:/home/david/apache-cassandra/bin# python cqlsh
 dc3114.dc3.host.com
 Connected to TestCluster at dc3114.dc3.host.com:9160.
 [cqlsh 2.3.0 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol
 19.36.0]
 Use HELP for help.
 cqlsh use testingkeyspace ;
 cqlsh:testingkeyspace select count(*) from test_metadata ;

 count
 ---
 12

 cqlsh:testingkeyspace exit
 root@machineA:/home/david/apache-cassandra/bin# python cqlsh
 dc18b0c.dc1.host.com
 Connected to TestCluster at dc18b0c.dc1.host.com:9160.
 [cqlsh 2.3.0 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol
 19.36.0]
 Use HELP for help.
 cqlsh use testingkeyspace ;
 cqlsh:testingkeyspace select count(*) from test_metadata ;

 count
 ---
 16

 What could be the reason for this sync issue? Can anyone shed  some light
 on this?

 Since our java driver code and datastax c++ driver code are using these
 tables with CONSISTENCY LEVEL ONE.