Re: Exception with java driver

2014-06-19 Thread Sylvain Lebresne
Please don't post on two mailing lists at once, it makes it impossible for
people that are not subscribed to the 2 mailing list to follow the thread
(and is bad form in general). If unsure which one is the most appropriate,
it's fine, pick your best guest (in this case it's clearly a java driver
question).

--
Sylvain


On Thu, Jun 19, 2014 at 5:22 AM, Shaheen Afroz shaheenn.af...@gmail.com
wrote:

 +Cassandra DL

 We have Cassandra nodes in three datacenters - dc1, dc2 and dc3 and the
 cluster name is DataCluster. In the same way, our application code is also
 in same three datacenters. Our application code is accessing cassandra.

 Now I want to make sure if application call is coming from `dc1` then it
 should go to cassandra `dc1` always. Same with `dc2` and `dc3`.

 So I decided to use DCAwareRoundRobinPolicy of datastax java driver.
 Cassandra version we have is DSE 4.0 and datastax java driver version we
 are using is 2.0.2.

 But somehow with the below code it always gives me an excpetion as
 NoHostAvailableException -

 com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
 tried for query failed (no host was tried)
  But in the same code, if I comment out below line and run it again. It
 works fine without any problem. That is pretty strange. What could be wrong
 with DCAwareRoundRobinPolicy or Cassandra setup?

 .withLoadBalancingPolicy(new DCAwareRoundRobinPolicy(dc1))
  Below is my code -

 public static Cluster cluster;
  public static Session session;
 public static Builder builder;

 public static void main(String[] args) {

  try {
 builder = Cluster.builder();
  builder.addContactPoint(some1_dc1_machine);
  builder.addContactPoint(some2_dc1_machine);
 builder.addContactPoint(some1_dc2_machine);
  builder.addContactPoint(some2_dc2_machine);
 builder.addContactPoint(some1_dc3_machine);
  builder.addContactPoint(some2_dc3_machine);
  PoolingOptions opts = new PoolingOptions();
 opts.setCoreConnectionsPerHost(HostDistance.LOCAL,
 opts.getCoreConnectionsPerHost(HostDistance.LOCAL));

 SocketOptions socketOpts = new SocketOptions();
  socketOpts.setReceiveBufferSize(1048576);
 socketOpts.setSendBufferSize(1048576);
  socketOpts.setTcpNoDelay(false);

  cluster = builder
 .withSocketOptions(socketOpts)
  .withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
 .withPoolingOptions(opts)
  .withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
  .withLoadBalancingPolicy(new DCAwareRoundRobinPolicy(dc1))
 .withCredentials(username, password)
  .build();


 session = cluster.connect(testingkeyspace);
  Metadata metadata = cluster.getMetadata();
 System.out.println(String.format(Connected to cluster '%s' on
 %s., metadata.getClusterName(), metadata.getAllHosts()));

  } catch (NoHostAvailableException e) {
 System.out.println(NoHostAvailableException);
  e.printStackTrace();
 System.out.println(e.getErrors());
  } catch (Exception e) {
 System.out.println(Exception);
  e.printStackTrace();
 }
  }

 To unsubscribe from this group and stop receiving emails from it, send an
 email to java-driver-user+unsubscr...@lists.datastax.com.



Re: EBS SSD - Cassandra ?

2014-06-19 Thread Alain RODRIGUEZ
Ok, looks fair enough.

Thanks guys. I would be great to be able to add disks when amount of data
raises and add nodes when throughput increases... :)


2014-06-19 5:27 GMT+02:00 Ben Bromhead b...@instaclustr.com:


 http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html

 From the link:

 EBS volumes are not recommended for Cassandra data volumes for the
 following reasons:

 * EBS volumes contend directly for network throughput with standard
 packets. This means that EBS throughput is likely to fail if you saturate a
 network link.
  * EBS volumes have unreliable performance. I/O performance can be
 exceptionally slow, causing the system to back load reads and writes until
 the entire cluster becomes unresponsive.
  * Adding capacity by increasing the number of EBS volumes per host does
 not scale. You can easily surpass the ability of the system to keep
 effective buffer caches and concurrently serve requests for all of the data
 it is responsible for managing.

 Still applies, especially the network contention and latency issues.

 Ben Bromhead
 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359

 On 18 Jun 2014, at 7:18 pm, Daniel Chia danc...@coursera.org wrote:

 While they guarantee IOPS, they don't really make any guarantees about
 latency. Since EBS goes over the network, there's so many things in the
 path of getting at your data, I would be concerned with random latency
 spikes, unless proven otherwise.

 Thanks,
 Daniel


 On Wed, Jun 18, 2014 at 1:58 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 In this document it is said :


- Provisioned IOPS (SSD) - Volumes of this type are ideal for the
most demanding I/O intensive, transactional workloads and large relational
or NoSQL databases. This volume type provides the most consistent
performance and allows you to provision the exact level of performance you
need with the most predictable and consistent performance. With this type
of volume you provision exactly what you need, and pay for what you
provision. Once again, you can achieve up to 48,000 IOPS by connecting
multiple volumes together using RAID.



 2014-06-18 10:57 GMT+02:00 Alain RODRIGUEZ arodr...@gmail.com:

 Hi,

 I just saw this :
 http://aws.amazon.com/fr/blogs/aws/new-ssd-backed-elastic-block-storage/

 Since the problem with EBS was the network, there is no chance that this
 hardware architecture might be useful alongside Cassandra, right ?

 Alain







Are writes to indexes performed asynchronously?

2014-06-19 Thread Tom van den Berge
Hi,

I have a column family with a secondary index on one of its columns. I
noticed that when I write a row to the column family, and immediately query
that row through the secondary index, every now and then it won't give any
results.

Could it be that Cassandra performs the write to the internal index column
family asynchronously? That might explain this behaviour.

In other words, when writing to a indexed column family, is there, or can
there be any guarantee that the write to index is completed when the write
to the original column family is completed?

I'm using a single-node cluster, with consistency level ONE.

Thanks,
Tom


Re: EBS SSD - Cassandra ?

2014-06-19 Thread Benedict Elliott Smith
I would say this is worth benchmarking before jumping to conclusions. The
network being a bottleneck (or latency causing) for EBS is, to my
knowledge, supposition, and instances can be started with direct
connections to EBS if this is a concern. The blog post below shows that
even without SSDs the EBS-optimised provisioned-IOPS instances show pretty
consistent latency numbers, although those latencies are higher than you
would typically expect from locally attached storage.

http://blog.parse.com/2012/09/17/parse-databases-upgraded-to-amazon-provisioned-iops/

Note, I'm not endorsing the use of EBS. Cassandra is designed to scale up
with number of nodes, not with depth of nodes (as Ben mentions, saturating
a single node's data capacity is pretty easy these days. CPUs rapidly
become the bottleneck as you try to go deep). However the argument that EBS
cannot provide consistent performance seems overly pessimistic, and should
probably be empirically determined for your use case.


On Thu, Jun 19, 2014 at 9:50 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Ok, looks fair enough.

 Thanks guys. I would be great to be able to add disks when amount of data
 raises and add nodes when throughput increases... :)


 2014-06-19 5:27 GMT+02:00 Ben Bromhead b...@instaclustr.com:


 http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html

 From the link:

 EBS volumes are not recommended for Cassandra data volumes for the
 following reasons:

 • EBS volumes contend directly for network throughput with standard
 packets. This means that EBS throughput is likely to fail if you saturate a
 network link.
  • EBS volumes have unreliable performance. I/O performance can be
 exceptionally slow, causing the system to back load reads and writes until
 the entire cluster becomes unresponsive.
  • Adding capacity by increasing the number of EBS volumes per host does
 not scale. You can easily surpass the ability of the system to keep
 effective buffer caches and concurrently serve requests for all of the data
 it is responsible for managing.

 Still applies, especially the network contention and latency issues.

 Ben Bromhead
 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359

 On 18 Jun 2014, at 7:18 pm, Daniel Chia danc...@coursera.org wrote:

 While they guarantee IOPS, they don't really make any guarantees about
 latency. Since EBS goes over the network, there's so many things in the
 path of getting at your data, I would be concerned with random latency
 spikes, unless proven otherwise.

 Thanks,
 Daniel


 On Wed, Jun 18, 2014 at 1:58 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 In this document it is said :


- Provisioned IOPS (SSD) - Volumes of this type are ideal for the
most demanding I/O intensive, transactional workloads and large 
 relational
or NoSQL databases. This volume type provides the most consistent
performance and allows you to provision the exact level of performance 
 you
need with the most predictable and consistent performance. With this type
of volume you provision exactly what you need, and pay for what you
provision. Once again, you can achieve up to 48,000 IOPS by connecting
multiple volumes together using RAID.



 2014-06-18 10:57 GMT+02:00 Alain RODRIGUEZ arodr...@gmail.com:

 Hi,

 I just saw this :
 http://aws.amazon.com/fr/blogs/aws/new-ssd-backed-elastic-block-storage/

 Since the problem with EBS was the network, there is no chance that
 this hardware architecture might be useful alongside Cassandra, right ?

 Alain








Best practices for repair

2014-06-19 Thread Paolo Crosato

Hi eveybody,

we have some problems running repairs on a timely schedule. We have a 
three node deployment, and we start repair on one node every week, 
repairing one columnfamily by one.
However, when we run into the big column families, usually repair 
sessions hangs undefinitely, and we have to restart them manually.


The script runs commands like:

nodetool repair keyspace columnfamily

one by one.

This has not been a major issue for some time, since we never delete 
data, however we would like to sort the issue once and for all.


Reading resources on the net, I came to the conclusion that we could:

1) either run a repair sessione like the one above, but with the -pr 
switch, and run it on every node, not just on one
2) or run sub range repair as described here 
http://www.datastax.com/dev/blog/advanced-repair-techniques , which 
would be the best option.
However the latter procedure would require us to write some java program 
that calls describe_splits to get the tokens to feed nodetool repair with.


The second procedure is available out of the box only in the commercial 
version of the opscenter, is this true?


I would like to know if these are the current best practices for repairs 
or if there is some other option that makes repair easier to perform, 
and more

reliable that it is now.

Regards,

Paolo Crosato

--
Paolo Crosato
Software engineer/Custom Solutions
e-mail: paolo.cros...@targaubiest.com



Metris library for time series data and cassandra.

2014-06-19 Thread Kevin Burton
Hey guys.

If you haven't seen KairosDB, it's a time series database on top of
cassandra.

Anyway, we're deploying it in production.  However, the existing APIs are a
bit raw (requiring you to send JSON directly) and don't provide much on top
of syntactic sugar.

There's the codahale metrics API which a lot of people are now using,
including Cassandra, and I've been really happy with it.

Except it doesn't support tags…

So I took it and extended it.

https://github.com/burtonator/metrics-kairosdb

Would love feedback and for you guys to just use it out of the box :)

I should probably put it in Maven central… but don't have time at the
moment. (if someone wants to help there that would rock).

I'm somewhat happy with it so far.

I'm probably going to refactor it to use more of a fluent API / builder
pattern.

Anyway… have at it!  Hopefully it helps someone!

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


Re: Batch of prepared statements exceeding specified threshold

2014-06-19 Thread Pavel Kogan
What a coincidence! Today happened in my cluster of 7 nodes as well.

Regards,
  Pavel


On Wed, Jun 18, 2014 at 11:13 AM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 I have a 10 node cluster with cassandra 2.0.8.

 I am taking this exceptions in the log when I run my code. What my code
 does is just reading data from a CF and in some cases it writes new data.

  WARN [Native-Transport-Requests:553] 2014-06-18 11:04:51,391
 BatchStatement.java (line 228) Batch of prepared statements for
 [identification1.entity, identification1.entity_lookup] is of size 6165,
 exceeding specified threshold of 5120 by 1045.
  WARN [Native-Transport-Requests:583] 2014-06-18 11:05:01,152
 BatchStatement.java (line 228) Batch of prepared statements for
 [identification1.entity, identification1.entity_lookup] is of size 21266,
 exceeding specified threshold of 5120 by 16146.
  WARN [Native-Transport-Requests:581] 2014-06-18 11:05:20,229
 BatchStatement.java (line 228) Batch of prepared statements for
 [identification1.entity, identification1.entity_lookup] is of size 22978,
 exceeding specified threshold of 5120 by 17858.
  INFO [MemoryMeter:1] 2014-06-18 11:05:32,682 Memtable.java (line 481)
 CFS(Keyspace='OpsCenter', ColumnFamily='rollups300') liveRatio is
 14.249755859375 (just-counted was 9.85302734375).  calculation took 3ms for
 1024 cells

 After some time, one node of the cluster goes down. Then it goes back
 after some seconds and another node goes down. It keeps happening and there
 is always a node down in the cluster, when it goes back another one falls.

 The only exceptions I see in the log is connected reset by the peer,
 which seems to be relative to gossip protocol, when a node goes down.

 Any hint of what could I do to investigate this problem further?

 Best regards,
 Marcelo Valle.



Re: Best practices for repair

2014-06-19 Thread Jack Krupansky

The DataStax doc should be current best practices:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html

If you or anybody else finds it inadequate, speak up.

-- Jack Krupansky

-Original Message- 
From: Paolo Crosato

Sent: Thursday, June 19, 2014 10:13 AM
To: user@cassandra.apache.org
Subject: Best practices for repair

Hi eveybody,

we have some problems running repairs on a timely schedule. We have a
three node deployment, and we start repair on one node every week,
repairing one columnfamily by one.
However, when we run into the big column families, usually repair
sessions hangs undefinitely, and we have to restart them manually.

The script runs commands like:

nodetool repair keyspace columnfamily

one by one.

This has not been a major issue for some time, since we never delete
data, however we would like to sort the issue once and for all.

Reading resources on the net, I came to the conclusion that we could:

1) either run a repair sessione like the one above, but with the -pr
switch, and run it on every node, not just on one
2) or run sub range repair as described here
http://www.datastax.com/dev/blog/advanced-repair-techniques , which
would be the best option.
However the latter procedure would require us to write some java program
that calls describe_splits to get the tokens to feed nodetool repair with.

The second procedure is available out of the box only in the commercial
version of the opscenter, is this true?

I would like to know if these are the current best practices for repairs
or if there is some other option that makes repair easier to perform,
and more
reliable that it is now.

Regards,

Paolo Crosato

--
Paolo Crosato
Software engineer/Custom Solutions
e-mail: paolo.cros...@targaubiest.com 



Re: EBS SSD - Cassandra ?

2014-06-19 Thread Nate McCall
If someone really wanted to try this it, I recommend adding an Elastic
Network Interface or two for gossip and client/API traffic. This lets EBS
and management traffic have the pre-configured network.


On Thu, Jun 19, 2014 at 6:54 AM, Benedict Elliott Smith 
belliottsm...@datastax.com wrote:

 I would say this is worth benchmarking before jumping to conclusions. The
 network being a bottleneck (or latency causing) for EBS is, to my
 knowledge, supposition, and instances can be started with direct
 connections to EBS if this is a concern. The blog post below shows that
 even without SSDs the EBS-optimised provisioned-IOPS instances show pretty
 consistent latency numbers, although those latencies are higher than you
 would typically expect from locally attached storage.


 http://blog.parse.com/2012/09/17/parse-databases-upgraded-to-amazon-provisioned-iops/

 Note, I'm not endorsing the use of EBS. Cassandra is designed to scale up
 with number of nodes, not with depth of nodes (as Ben mentions, saturating
 a single node's data capacity is pretty easy these days. CPUs rapidly
 become the bottleneck as you try to go deep). However the argument that EBS
 cannot provide consistent performance seems overly pessimistic, and should
 probably be empirically determined for your use case.


 On Thu, Jun 19, 2014 at 9:50 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Ok, looks fair enough.

 Thanks guys. I would be great to be able to add disks when amount of data
 raises and add nodes when throughput increases... :)


 2014-06-19 5:27 GMT+02:00 Ben Bromhead b...@instaclustr.com:


 http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html

 From the link:

 EBS volumes are not recommended for Cassandra data volumes for the
 following reasons:

 • EBS volumes contend directly for network throughput with standard
 packets. This means that EBS throughput is likely to fail if you saturate a
 network link.
  • EBS volumes have unreliable performance. I/O performance can be
 exceptionally slow, causing the system to back load reads and writes until
 the entire cluster becomes unresponsive.
  • Adding capacity by increasing the number of EBS volumes per host
 does not scale. You can easily surpass the ability of the system to keep
 effective buffer caches and concurrently serve requests for all of the data
 it is responsible for managing.

 Still applies, especially the network contention and latency issues.

 Ben Bromhead
 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359

 On 18 Jun 2014, at 7:18 pm, Daniel Chia danc...@coursera.org wrote:

 While they guarantee IOPS, they don't really make any guarantees about
 latency. Since EBS goes over the network, there's so many things in the
 path of getting at your data, I would be concerned with random latency
 spikes, unless proven otherwise.

 Thanks,
 Daniel


 On Wed, Jun 18, 2014 at 1:58 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 In this document it is said :


- Provisioned IOPS (SSD) - Volumes of this type are ideal for the
most demanding I/O intensive, transactional workloads and large 
 relational
or NoSQL databases. This volume type provides the most consistent
performance and allows you to provision the exact level of performance 
 you
need with the most predictable and consistent performance. With this 
 type
of volume you provision exactly what you need, and pay for what you
provision. Once again, you can achieve up to 48,000 IOPS by connecting
multiple volumes together using RAID.



 2014-06-18 10:57 GMT+02:00 Alain RODRIGUEZ arodr...@gmail.com:

 Hi,

 I just saw this :
 http://aws.amazon.com/fr/blogs/aws/new-ssd-backed-elastic-block-storage/

 Since the problem with EBS was the network, there is no chance that
 this hardware architecture might be useful alongside Cassandra, right ?

 Alain









-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder  Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Re: Migration 1.2.14 to 2.0.8 causes Tried to create duplicate hard link at startup

2014-06-19 Thread Tom van den Berge
It turns out this is caused by an earlier, failed attempt to upgrade.
Removing all pre-sstablemetamigration snapshot directories solved the issue.

Credits to Markus Eriksson.


On Wed, Jun 11, 2014 at 9:42 AM, Tom van den Berge t...@drillster.com
wrote:

 No, unfortunately I haven't.




 On Tue, Jun 10, 2014 at 5:35 PM, Chris Burroughs 
 chris.burrou...@gmail.com wrote:

 Were you able to solve or work around this problem?


 On 06/05/2014 11:47 AM, Tom van den Berge wrote:

 Hi,

 I'm trying to migrate a development cluster from 1.2.14 to 2.0.8. When
 starting up 2.0.8, I'm seeing the following error in the logs:


   INFO 17:40:25,405 Snapshotting drillster, Account to
 pre-sstablemetamigration
 ERROR 17:40:25,407 Exception encountered during startup
 java.lang.RuntimeException: Tried to create duplicate hard link to
 /Users/tom/cassandra-data/data/drillster/Account/snapshots/pre-
 sstablemetamigration/drillster-Account-ic-65-Filter.db
  at
 org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:75)
  at
 org.apache.cassandra.db.compaction.LegacyLeveledManifest.
 snapshotWithoutCFS(LegacyLeveledManifest.java:129)
  at
 org.apache.cassandra.db.compaction.LegacyLeveledManifest.
 migrateManifests(LegacyLeveledManifest.java:91)
  at
 org.apache.cassandra.db.compaction.LeveledManifest.
 maybeMigrateManifests(LeveledManifest.java:617)
  at
 org.apache.cassandra.service.CassandraDaemon.setup(
 CassandraDaemon.java:274)
  at
 org.apache.cassandra.service.CassandraDaemon.activate(
 CassandraDaemon.java:496)
  at
 org.apache.cassandra.service.CassandraDaemon.main(
 CassandraDaemon.java:585)


 Does anyone have an idea how to solve this?


 Thanks,
 Tom





 --

 Drillster BV
 Middenburcht 136
 3452MT Vleuten
 Netherlands

 +31 30 755 5330

 Open your free account at www.drillster.com




-- 

Drillster BV
Middenburcht 136
3452MT Vleuten
Netherlands

+31 30 755 5330

Open your free account at www.drillster.com


Re: EBS SSD - Cassandra ?

2014-06-19 Thread Russell Bradberry
does an elastic network interface really use a different physical network 
interface? or is it just to give the ability for multiple ip addresses?



On June 19, 2014 at 3:56:34 PM, Nate McCall (n...@thelastpickle.com) wrote:

If someone really wanted to try this it, I recommend adding an Elastic Network 
Interface or two for gossip and client/API traffic. This lets EBS and 
management traffic have the pre-configured network. 


On Thu, Jun 19, 2014 at 6:54 AM, Benedict Elliott Smith 
belliottsm...@datastax.com wrote:
I would say this is worth benchmarking before jumping to conclusions. The 
network being a bottleneck (or latency causing) for EBS is, to my knowledge, 
supposition, and instances can be started with direct connections to EBS if 
this is a concern. The blog post below shows that even without SSDs the 
EBS-optimised provisioned-IOPS instances show pretty consistent latency 
numbers, although those latencies are higher than you would typically expect 
from locally attached storage.

http://blog.parse.com/2012/09/17/parse-databases-upgraded-to-amazon-provisioned-iops/

Note, I'm not endorsing the use of EBS. Cassandra is designed to scale up with 
number of nodes, not with depth of nodes (as Ben mentions, saturating a single 
node's data capacity is pretty easy these days. CPUs rapidly become the 
bottleneck as you try to go deep). However the argument that EBS cannot provide 
consistent performance seems overly pessimistic, and should probably be 
empirically determined for your use case.


On Thu, Jun 19, 2014 at 9:50 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
Ok, looks fair enough.

Thanks guys. I would be great to be able to add disks when amount of data 
raises and add nodes when throughput increases... :)


2014-06-19 5:27 GMT+02:00 Ben Bromhead b...@instaclustr.com:

http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html

From the link:

EBS volumes are not recommended for Cassandra data volumes for the following 
reasons:

• EBS volumes contend directly for network throughput with standard packets. 
This means that EBS throughput is likely to fail if you saturate a network link.
• EBS volumes have unreliable performance. I/O performance can be exceptionally 
slow, causing the system to back load reads and writes until the entire cluster 
becomes unresponsive.
• Adding capacity by increasing the number of EBS volumes per host does not 
scale. You can easily surpass the ability of the system to keep effective 
buffer caches and concurrently serve requests for all of the data it is 
responsible for managing.

Still applies, especially the network contention and latency issues. 

Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359

On 18 Jun 2014, at 7:18 pm, Daniel Chia danc...@coursera.org wrote:

While they guarantee IOPS, they don't really make any guarantees about latency. 
Since EBS goes over the network, there's so many things in the path of getting 
at your data, I would be concerned with random latency spikes, unless proven 
otherwise.

Thanks,
Daniel


On Wed, Jun 18, 2014 at 1:58 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
In this document it is said :

Provisioned IOPS (SSD) - Volumes of this type are ideal for the most demanding 
I/O intensive, transactional workloads and large relational or NoSQL databases. 
This volume type provides the most consistent performance and allows you to 
provision the exact level of performance you need with the most predictable and 
consistent performance. With this type of volume you provision exactly what you 
need, and pay for what you provision. Once again, you can achieve up to 48,000 
IOPS by connecting multiple volumes together using RAID.


2014-06-18 10:57 GMT+02:00 Alain RODRIGUEZ arodr...@gmail.com:

Hi,

I just saw this : 
http://aws.amazon.com/fr/blogs/aws/new-ssd-backed-elastic-block-storage/

Since the problem with EBS was the network, there is no chance that this 
hardware architecture might be useful alongside Cassandra, right ?

Alain








--
-
Nate McCall
Austin, TX
@zznate

Co-Founder  Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Best practices for repair

2014-06-19 Thread Paulo Ricardo Motta Gomes
Hello Paolo,

I just published an open source version of the dsetool list_subranges
command, which will enable you to perform subrange repair as described in
the post.

You can find the code and usage instructions here:
https://github.com/pauloricardomg/cassandra-list-subranges

Currently available for 1.2.16, but I guess that just changing the version
on the pom.xml and recompiling it will make it work on 2.0.x.

Cheers,

Paulo


On Thu, Jun 19, 2014 at 4:40 PM, Jack Krupansky j...@basetechnology.com
wrote:

 The DataStax doc should be current best practices:
 http://www.datastax.com/documentation/cassandra/2.0/
 cassandra/operations/ops_repair_nodes_c.html

 If you or anybody else finds it inadequate, speak up.

 -- Jack Krupansky

 -Original Message- From: Paolo Crosato
 Sent: Thursday, June 19, 2014 10:13 AM
 To: user@cassandra.apache.org
 Subject: Best practices for repair


 Hi eveybody,

 we have some problems running repairs on a timely schedule. We have a
 three node deployment, and we start repair on one node every week,
 repairing one columnfamily by one.
 However, when we run into the big column families, usually repair
 sessions hangs undefinitely, and we have to restart them manually.

 The script runs commands like:

 nodetool repair keyspace columnfamily

 one by one.

 This has not been a major issue for some time, since we never delete
 data, however we would like to sort the issue once and for all.

 Reading resources on the net, I came to the conclusion that we could:

 1) either run a repair sessione like the one above, but with the -pr
 switch, and run it on every node, not just on one
 2) or run sub range repair as described here
 http://www.datastax.com/dev/blog/advanced-repair-techniques , which
 would be the best option.
 However the latter procedure would require us to write some java program
 that calls describe_splits to get the tokens to feed nodetool repair with.

 The second procedure is available out of the box only in the commercial
 version of the opscenter, is this true?

 I would like to know if these are the current best practices for repairs
 or if there is some other option that makes repair easier to perform,
 and more
 reliable that it is now.

 Regards,

 Paolo Crosato

 --
 Paolo Crosato
 Software engineer/Custom Solutions
 e-mail: paolo.cros...@targaubiest.com




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Issues with intenode encyrption - Keystore was tampered with, or password was incorrect

2014-06-19 Thread Carlos Scheidecker
Hello,

I am using Cassandra 2.1.0-rc1 and trying to set up internode encryption.

Here's how I have generated the certificates and keystores:

keytool -genkeypair -v -keyalg RSA -keysize 1024 -alias node1 -keystore
node1.keystore -storepass 'mypassword' -dname 'CN=Development' -keypass
'mypassword' -validity 3650

keytool -export -v -alias node1 -file node1.cer -keystore node1.keystore
-storepass 'mypassword'

keytool -import -v -trustcacerts -alias node1 -file node1.cer -keystore
global.truststore -storepass 'mypassword' -noprompt

Now, I have create a folder /etc/cassandra/certs

and copied the certs there: node1.keystore and global.truststore

Set the ownership and permissions of both to chown cassandra:cassandra and
chmod 600

Then on Cassandra.yaml I did the following:

server_encryption_options:
internode_encryption: all
keystore: /etc/cassandra/certs/node1.keystore
keystore_password: mypassword
truststore: /etc/cassandra/certs/global.truststore
truststore_password: mypassword


When starting the server I get the following error:

ERROR [main] 2014-06-19 15:00:03,701 CassandraDaemon.java:340 - Fatal
configuration error
org.apache.cassandra.exceptions.ConfigurationException: Unable to create
ssl socket
at
org.apache.cassandra.net.MessagingService.getServerSockets(MessagingService.java:431)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
org.apache.cassandra.net.MessagingService.listen(MessagingService.java:411)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
at
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:694)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:628)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:511)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:336)
[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:455)
[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:544)
[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
Caused by: java.io.IOException: Error creating the initializing the SSL
Context
 at
org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:124)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
at
org.apache.cassandra.security.SSLFactory.getServerSocket(SSLFactory.java:53)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
org.apache.cassandra.net.MessagingService.getServerSockets(MessagingService.java:427)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
... 7 common frames omitted
Caused by: java.io.IOException: Keystore was tampered with, or password was
incorrect
at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:772)
~[na:1.8.0_05]
 at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:55)
~[na:1.8.0_05]
at java.security.KeyStore.load(KeyStore.java:1433) ~[na:1.8.0_05]
 at
org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:108)
~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
... 9 common frames omitted
Caused by: java.security.UnrecoverableKeyException: Password verification
failed
at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:770)
~[na:1.8.0_05]
 ... 12 common frames omitted
INFO  [StorageServiceShutdownHook] 2014-06-19 15:00:03,705
Gossiper.java:1272 - Announcing shutdown


Why would the certificate fail and the error Keystore was tampered with,
or password was incorrect is displayed?

I have tested the keystore password by doing keytool -list -keystore
node1.keystore

And it shows the certficate and password is correct:

keytool -list -keystore node1.keystore -storepass mypassword

Keystore type: JKS
Keystore provider: SUN

Your keystore contains 1 entry

node1, Jun 19, 2014, PrivateKeyEntry,
Certificate fingerprint (SHA1):
85:28:6F:75:B5:E2:CE:5C:52:84:AC:A6:12:FC:45:FB:BA:8D:97:4D

Have no idea what went wrong as I have tried to find out.

It does not seem to be a Cassandra issue but more likely an issue while
generating the keystore and trustore.

I am doing it for 4 nodes, which is why trustore is the same file name,
only keystores are different names which are unique for the nodes.

Thanks.


Re: running out of diskspace during maintenance tasks

2014-06-19 Thread Jens Rantil
Hi Brian,


What compaction are you running? Have you tried using leveled compaction? AFAIK 
it should generally require less disk space during compaction.




Cheers,

Jens
—
Sent from Mailbox

On Wed, Jun 18, 2014 at 6:02 PM, Brian Tarbox tar...@cabotresearch.com
wrote:

 I'm running on AWS m2.2xlarge instances using the ~800 gig
 ephemeral/attached disk for my data directory.  My data size per node is
 nearing 400 gig.
 Sometimes during maintenance operations (repairs mostly I think) I run out
 of disk space as my understanding is that some of these operations require
 double the space of one's data.
 Since I can't change the size of attached storage for my instance type my
 question is can I somehow get these maintenance operations to use other
 volumes?
 Failing that, what are my options?  Thanks.
 Brian Tarbox

Re: can I kill very old data files in my data folder (I know that sounds crazy but....)

2014-06-19 Thread Jens Rantil
...and temporarily adding more nodes and rebalancing is not an option?—
Sent from Mailbox

On Wed, Jun 18, 2014 at 9:39 PM, Brian Tarbox tar...@cabotresearch.com
wrote:

 I don't think I have the space to run a major compaction right now (I'm
 above 50% disk space used already) and compaction can take extra space I
 think?
 On Wed, Jun 18, 2014 at 3:24 PM, Robert Coli rc...@eventbrite.com wrote:
 On Wed, Jun 18, 2014 at 12:05 PM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 Thank you!   We are not using TTL, we're manually deleting data more than
 5 days old for this CF.  We're running 1.2.13 and are using size tiered
 compaction (this cf is append-only i.e.zero updates).

 Sounds like we can get away with doing a (stop, delete old-data-file,
 restart) process on a rolling basis if I understand you.


 Sure, though in your case (because you're using STS and can) I'd probably
 just run a major compaction.

 =Rob



Re: EBS SSD - Cassandra ?

2014-06-19 Thread Nate McCall
Sorry - should have been clear I was speaking in terms of route optimizing,
not bandwidth. No idea as to the implementation (probably instance
specific) and I doubt it actually doubles bandwidth.

Specifically: having an ENI dedicated to API traffic did smooth out some
recent load tests we did for a client. It could be that overall throughput
increases where more a function of cleaner traffic segmentation/smoother
routing. We werent being terribly scientific - was more an artifact of
testing network segmentation.

I'm just going to say that using an ENI will make things better (since
traffic segmentation is always good practice anyway :)  YMMV.



On Thu, Jun 19, 2014 at 3:39 PM, Russell Bradberry rbradbe...@gmail.com
wrote:

 does an elastic network interface really use a different physical network
 interface? or is it just to give the ability for multiple ip addresses?



 On June 19, 2014 at 3:56:34 PM, Nate McCall (n...@thelastpickle.com)
 wrote:

 If someone really wanted to try this it, I recommend adding an Elastic
 Network Interface or two for gossip and client/API traffic. This lets EBS
 and management traffic have the pre-configured network.


 On Thu, Jun 19, 2014 at 6:54 AM, Benedict Elliott Smith 
 belliottsm...@datastax.com wrote:

 I would say this is worth benchmarking before jumping to conclusions. The
 network being a bottleneck (or latency causing) for EBS is, to my
 knowledge, supposition, and instances can be started with direct
 connections to EBS if this is a concern. The blog post below shows that
 even without SSDs the EBS-optimised provisioned-IOPS instances show pretty
 consistent latency numbers, although those latencies are higher than you
 would typically expect from locally attached storage.


 http://blog.parse.com/2012/09/17/parse-databases-upgraded-to-amazon-provisioned-iops/

 Note, I'm not endorsing the use of EBS. Cassandra is designed to scale up
 with number of nodes, not with depth of nodes (as Ben mentions, saturating
 a single node's data capacity is pretty easy these days. CPUs rapidly
 become the bottleneck as you try to go deep). However the argument that EBS
 cannot provide consistent performance seems overly pessimistic, and should
 probably be empirically determined for your use case.


 On Thu, Jun 19, 2014 at 9:50 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Ok, looks fair enough.

 Thanks guys. I would be great to be able to add disks when amount of
 data raises and add nodes when throughput increases... :)


 2014-06-19 5:27 GMT+02:00 Ben Bromhead b...@instaclustr.com:


 http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html

 From the link:

 EBS volumes are not recommended for Cassandra data volumes for the
 following reasons:

 • EBS volumes contend directly for network throughput with standard
 packets. This means that EBS throughput is likely to fail if you saturate a
 network link.
 • EBS volumes have unreliable performance. I/O performance can be
 exceptionally slow, causing the system to back load reads and writes until
 the entire cluster becomes unresponsive.
 • Adding capacity by increasing the number of EBS volumes per host does
 not scale. You can easily surpass the ability of the system to keep
 effective buffer caches and concurrently serve requests for all of the data
 it is responsible for managing.

 Still applies, especially the network contention and latency issues.

   Ben Bromhead
  Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359

  On 18 Jun 2014, at 7:18 pm, Daniel Chia danc...@coursera.org wrote:

  While they guarantee IOPS, they don't really make any guarantees
 about latency. Since EBS goes over the network, there's so many things in
 the path of getting at your data, I would be concerned with random latency
 spikes, unless proven otherwise.

 Thanks,
 Daniel


 On Wed, Jun 18, 2014 at 1:58 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 In this document it is said :


- Provisioned IOPS (SSD) - Volumes of this type are ideal for the
most demanding I/O intensive, transactional workloads and large 
 relational
or NoSQL databases. This volume type provides the most consistent
performance and allows you to provision the exact level of performance 
 you
need with the most predictable and consistent performance. With this 
 type
of volume you provision exactly what you need, and pay for what you
provision. Once again, you can achieve up to 48,000 IOPS by connecting
multiple volumes together using RAID.



 2014-06-18 10:57 GMT+02:00 Alain RODRIGUEZ arodr...@gmail.com:

  Hi,

 I just saw this :
 http://aws.amazon.com/fr/blogs/aws/new-ssd-backed-elastic-block-storage/

 Since the problem with EBS was the network, there is no chance that
 this hardware architecture might be useful alongside Cassandra, right ?

 Alain









 --
 -
 Nate McCall
 Austin, TX
 @zznate

 Co-Founder  

Re: Issues with intenode encyrption - Keystore was tampered with, or password was incorrect

2014-06-19 Thread Carlos Scheidecker
Never mind fellas.

Found the stupid error here. Sharing with you just in case. Typo error on
my script to generate those.

I have the '' characters while generating the keystore and certificates.
 -keystore 'mypassword' while correct is -keystore mypassword

I knew it was a certificate issue, debugging it I was able to find it.

The longer you do things, the longer you are prone to errors.

cheers



On Thu, Jun 19, 2014 at 3:20 PM, Carlos Scheidecker nando@gmail.com
wrote:

 Hello,

 I am using Cassandra 2.1.0-rc1 and trying to set up internode encryption.

 Here's how I have generated the certificates and keystores:

 keytool -genkeypair -v -keyalg RSA -keysize 1024 -alias node1 -keystore
 node1.keystore -storepass 'mypassword' -dname 'CN=Development' -keypass
 'mypassword' -validity 3650

  keytool -export -v -alias node1 -file node1.cer -keystore node1.keystore
 -storepass 'mypassword'

 keytool -import -v -trustcacerts -alias node1 -file node1.cer -keystore
 global.truststore -storepass 'mypassword' -noprompt

 Now, I have create a folder /etc/cassandra/certs

 and copied the certs there: node1.keystore and global.truststore

 Set the ownership and permissions of both to chown cassandra:cassandra and
 chmod 600

 Then on Cassandra.yaml I did the following:

 server_encryption_options:
 internode_encryption: all
 keystore: /etc/cassandra/certs/node1.keystore
 keystore_password: mypassword
 truststore: /etc/cassandra/certs/global.truststore
 truststore_password: mypassword


 When starting the server I get the following error:

 ERROR [main] 2014-06-19 15:00:03,701 CassandraDaemon.java:340 - Fatal
 configuration error
 org.apache.cassandra.exceptions.ConfigurationException: Unable to create
 ssl socket
 at
 org.apache.cassandra.net.MessagingService.getServerSockets(MessagingService.java:431)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
  at
 org.apache.cassandra.net.MessagingService.listen(MessagingService.java:411)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
 org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:694)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
  at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:628)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:511)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
  at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:336)
 [apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:455)
 [apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
  at
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:544)
 [apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 Caused by: java.io.IOException: Error creating the initializing the SSL
 Context
  at
 org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:124)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 at
 org.apache.cassandra.security.SSLFactory.getServerSocket(SSLFactory.java:53)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
  at
 org.apache.cassandra.net.MessagingService.getServerSockets(MessagingService.java:427)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 ... 7 common frames omitted
 Caused by: java.io.IOException: Keystore was tampered with, or password
 was incorrect
 at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:772)
 ~[na:1.8.0_05]
  at
 sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:55)
 ~[na:1.8.0_05]
 at java.security.KeyStore.load(KeyStore.java:1433) ~[na:1.8.0_05]
  at
 org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:108)
 ~[apache-cassandra-2.1.0~rc1.jar:2.1.0~rc1]
 ... 9 common frames omitted
 Caused by: java.security.UnrecoverableKeyException: Password verification
 failed
 at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:770)
 ~[na:1.8.0_05]
  ... 12 common frames omitted
 INFO  [StorageServiceShutdownHook] 2014-06-19 15:00:03,705
 Gossiper.java:1272 - Announcing shutdown


 Why would the certificate fail and the error Keystore was tampered with,
 or password was incorrect is displayed?

 I have tested the keystore password by doing keytool -list -keystore
 node1.keystore

 And it shows the certficate and password is correct:

 keytool -list -keystore node1.keystore -storepass mypassword

 Keystore type: JKS
 Keystore provider: SUN

 Your keystore contains 1 entry

 node1, Jun 19, 2014, PrivateKeyEntry,
 Certificate fingerprint (SHA1):
 85:28:6F:75:B5:E2:CE:5C:52:84:AC:A6:12:FC:45:FB:BA:8D:97:4D

 Have no idea what went wrong as I have tried to find out.

 It does not seem to be a Cassandra issue but more likely an issue while
 generating the keystore and trustore.

 I am doing it for 4 nodes, which is why trustore is the same file name,
 only keystores are different names which are unique for the nodes.

 Thanks.





Re: Batch of prepared statements exceeding specified threshold

2014-06-19 Thread Marcelo Elias Del Valle
I know now it's been caused by the heap filling up in some nodes. When it
fills up, the node goes does, GC runs more, then the node goes up again.
Looking for GCInspector in the log, I see GC takes more time to run each
time it runs, as shown bellow.
I have set key cache to 100 mb and I was used to query much much more rows
in Cassandra before...

I am trying to find some query I do that returns a lot of coluns, but
usually it's returning just 1 row and each row use to have 100 columns at
most...
I will keep looking, but is it possible to be a bug? I am using the newest
version of Cassandra.

INFO [ScheduledTasks:1] 2014-06-19 19:36:36,240 GCInspector.java (line 116)
GC for ConcurrentMarkSweep: 14837 ms for 1 collections, 8245789072 used;
max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:36:57,621 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 20637 ms for 1 collections, 8403381728
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:37:13,291 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 15134 ms for 1 collections, 8398383880
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:37:34,775 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 20897 ms for 1 collections, 8404085176
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:37:50,364 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 14926 ms for 1 collections, 8293046264
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:38:11,762 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 20705 ms for 1 collections, 8426815144
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:38:27,413 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 15172 ms for 1 collections, 8426043120
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:38:48,956 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 20993 ms for 1 collections, 8425551136
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:39:04,827 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 15374 ms for 1 collections, 8426721952
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:39:26,319 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 20958 ms for 1 collections, 8431842432
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:39:41,996 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 14984 ms for 1 collections, 8422533432
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:40:03,351 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 21220 ms for 1 collections, 8422545360
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:40:18,866 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 15135 ms for 1 collections, 8426651232
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:40:40,305 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 20953 ms for 1 collections, 8462319400
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:41:08,079 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 27551 ms for 2 collections, 8463374528
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:41:29,510 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 21189 ms for 1 collections, 8466512144
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:41:44,936 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 15220 ms for 1 collections, 8470873096
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:42:06,300 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 21178 ms for 1 collections, 8471013184
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:42:21,712 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 15227 ms for 1 collections, 8476991784
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:42:43,068 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 21199 ms for 1 collections, 8478612392
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:42:58,120 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 14917 ms for 1 collections, 8481096608
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:43:31,975 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 33698 ms for 2 collections, 8484881064
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:43:47,339 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 15239 ms for 1 collections, 8485352000
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:44:24,136 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 36576 ms for 2 collections, 8489333048
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:45:01,187 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 36861 ms for 2 collections, 8491277224
used; max is 8506048512
 INFO [ScheduledTasks:1] 2014-06-19 19:45:22,650 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 21369 ms for 1 collections, 8493227920
used; max is 8506048512
 INFO [ScheduledTasks:1] 

Re: Batch of prepared statements exceeding specified threshold

2014-06-19 Thread Marcelo Elias Del Valle
Pavel,

Out of curiosity, did it start to happen before some update? Which version
of Cassandra are you using?

[]s


2014-06-19 16:10 GMT-03:00 Pavel Kogan pavel.ko...@cortica.com:

 What a coincidence! Today happened in my cluster of 7 nodes as well.

 Regards,
   Pavel


 On Wed, Jun 18, 2014 at 11:13 AM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 I have a 10 node cluster with cassandra 2.0.8.

 I am taking this exceptions in the log when I run my code. What my code
 does is just reading data from a CF and in some cases it writes new data.

  WARN [Native-Transport-Requests:553] 2014-06-18 11:04:51,391
 BatchStatement.java (line 228) Batch of prepared statements for
 [identification1.entity, identification1.entity_lookup] is of size 6165,
 exceeding specified threshold of 5120 by 1045.
  WARN [Native-Transport-Requests:583] 2014-06-18 11:05:01,152
 BatchStatement.java (line 228) Batch of prepared statements for
 [identification1.entity, identification1.entity_lookup] is of size 21266,
 exceeding specified threshold of 5120 by 16146.
  WARN [Native-Transport-Requests:581] 2014-06-18 11:05:20,229
 BatchStatement.java (line 228) Batch of prepared statements for
 [identification1.entity, identification1.entity_lookup] is of size 22978,
 exceeding specified threshold of 5120 by 17858.
  INFO [MemoryMeter:1] 2014-06-18 11:05:32,682 Memtable.java (line 481)
 CFS(Keyspace='OpsCenter', ColumnFamily='rollups300') liveRatio is
 14.249755859375 (just-counted was 9.85302734375).  calculation took 3ms for
 1024 cells

 After some time, one node of the cluster goes down. Then it goes back
 after some seconds and another node goes down. It keeps happening and there
 is always a node down in the cluster, when it goes back another one falls.

 The only exceptions I see in the log is connected reset by the peer,
 which seems to be relative to gossip protocol, when a node goes down.

 Any hint of what could I do to investigate this problem further?

 Best regards,
 Marcelo Valle.





Best way to do a multi_get using CQL

2014-06-19 Thread Marcelo Elias Del Valle
I was taking a look at Cassandra anti-patterns list:

http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html

Among then is

SELECT ... IN or index lookups¶
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html?scroll=archPlanAntiPattern__AntiPatMultiGet

SELECT ... IN and index lookups (formerly secondary indexes) should be
avoided except for specific scenarios. See *When not to use IN* in SELECT
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
 and *When not to use an index* in Indexing
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_primary_index_c.html
 in
*CQL for Cassandra 2.0*

And Looking at the SELECT doc, I saw:
When *not* to use IN¶
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html?scroll=reference_ds_d35_v2q_xj__selectInNot
The recommendations about when not to use an index
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html
 apply to using IN in the WHERE clause. Under most conditions, using IN in
the WHERE clause is not recommended. Using IN can degrade performance
because usually many nodes must be queried. For example, in a single, local
data center cluster having 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes being
queried are most likely even higher, up to 20 nodes depending on where the
keys fall in the token range.

In my system, I have a column family called entity_lookup:

CREATE KEYSPACE IF NOT EXISTS Identification1
  WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
  'DC1' : 3 };
USE Identification1;

CREATE TABLE IF NOT EXISTS entity_lookup (
  name varchar,
  value varchar,
  entity_id uuid,
  PRIMARY KEY ((name, value), entity_id));

And I use the following select to query it:

SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)

Is this an anti-pattern?

If not using SELECT IN, which other way would you recomend for lookups like
that? I have several values I would like to search in cassandra and they
might not be in the same particion, as above.

Is Cassandra the wrong tool for lookups like that?

Best regards,
Marcelo Valle.


Re: Best way to do a multi_get using CQL

2014-06-19 Thread Jonathan Haddad
Your other option is to fire off async queries.  It's pretty
straightforward w/ the java or python drivers.

On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 I was taking a look at Cassandra anti-patterns list:

 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html

 Among then is

 SELECT ... IN or index lookups¶

 SELECT ... IN and index lookups (formerly secondary indexes) should be
 avoided except for specific scenarios. See When not to use IN in SELECT and
 When not to use an index in Indexing in

 CQL for Cassandra 2.0

 And Looking at the SELECT doc, I saw:

 When not to use IN¶

 The recommendations about when not to use an index apply to using IN in the
 WHERE clause. Under most conditions, using IN in the WHERE clause is not
 recommended. Using IN can degrade performance because usually many nodes
 must be queried. For example, in a single, local data center cluster having
 30 nodes, a replication factor of 3, and a consistency level of
 LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
 uses the IN condition, the number of nodes being queried are most likely
 even higher, up to 20 nodes depending on where the keys fall in the token
 range.

 In my system, I have a column family called entity_lookup:

 CREATE KEYSPACE IF NOT EXISTS Identification1
   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
   'DC1' : 3 };
 USE Identification1;

 CREATE TABLE IF NOT EXISTS entity_lookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY ((name, value), entity_id));

 And I use the following select to query it:

 SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)

 Is this an anti-pattern?

 If not using SELECT IN, which other way would you recomend for lookups like
 that? I have several values I would like to search in cassandra and they
 might not be in the same particion, as above.

 Is Cassandra the wrong tool for lookups like that?

 Best regards,
 Marcelo Valle.














-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade


Re: Best way to do a multi_get using CQL

2014-06-19 Thread Marcelo Elias Del Valle
But using async queries wouldn't be even worse than using SELECT IN?
The justification in the docs is I could query many nodes, but I would
still do it.

Today, I use both async queries AND SELECT IN:

SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP +  WHERE
name=%s and value in(%s)

for name, values in identifiers.items():
   query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values)))
   args = [name] + values
   query_msg = query % tuple(args)
   futures.append((query_msg, self.session.execute_async(query, args)))

for query_msg, future in futures:
   try:
  rows = future.result(timeout=10)
  for row in rows:
entity_ids.add(row.entity_id)
   except:
  logging.error(Query '%s' returned ERROR  % (query_msg))
  raise

Using async just with select = would mean instead of 1 async query
(example: in (0, 1, 2)), I would do several, one for each value of values
array above.
In my head, this would mean more connections to Cassandra and the same
amount of work, right? What would be the advantage?

[]s




2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Your other option is to fire off async queries.  It's pretty
 straightforward w/ the java or python drivers.

 On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  I was taking a look at Cassandra anti-patterns list:
 
 
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
 
  Among then is
 
  SELECT ... IN or index lookups¶
 
  SELECT ... IN and index lookups (formerly secondary indexes) should be
  avoided except for specific scenarios. See When not to use IN in SELECT
 and
  When not to use an index in Indexing in
 
  CQL for Cassandra 2.0
 
  And Looking at the SELECT doc, I saw:
 
  When not to use IN¶
 
  The recommendations about when not to use an index apply to using IN in
 the
  WHERE clause. Under most conditions, using IN in the WHERE clause is not
  recommended. Using IN can degrade performance because usually many nodes
  must be queried. For example, in a single, local data center cluster
 having
  30 nodes, a replication factor of 3, and a consistency level of
  LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
  uses the IN condition, the number of nodes being queried are most likely
  even higher, up to 20 nodes depending on where the keys fall in the token
  range.
 
  In my system, I have a column family called entity_lookup:
 
  CREATE KEYSPACE IF NOT EXISTS Identification1
WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
'DC1' : 3 };
  USE Identification1;
 
  CREATE TABLE IF NOT EXISTS entity_lookup (
name varchar,
value varchar,
entity_id uuid,
PRIMARY KEY ((name, value), entity_id));
 
  And I use the following select to query it:
 
  SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
 
  Is this an anti-pattern?
 
  If not using SELECT IN, which other way would you recomend for lookups
 like
  that? I have several values I would like to search in cassandra and they
  might not be in the same particion, as above.
 
  Is Cassandra the wrong tool for lookups like that?
 
  Best regards,
  Marcelo Valle.
 
 
 
 
 
 
 
 
 
 
 



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade



Re: Best way to do a multi_get using CQL

2014-06-19 Thread Jonathan Haddad
If you use async and your driver is token aware, it will go to the
proper node, rather than requiring the coordinator to do so.

Realistically you're going to have a connection open to every server
anyways.  It's the difference between you querying for the data
directly and using a coordinator as a proxy.  It's faster to just ask
the node with the data.

On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 But using async queries wouldn't be even worse than using SELECT IN?
 The justification in the docs is I could query many nodes, but I would still
 do it.

 Today, I use both async queries AND SELECT IN:

 SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP +  WHERE
 name=%s and value in(%s)

 for name, values in identifiers.items():
query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values)))
args = [name] + values
query_msg = query % tuple(args)
futures.append((query_msg, self.session.execute_async(query, args)))

 for query_msg, future in futures:
try:
   rows = future.result(timeout=10)
   for row in rows:
 entity_ids.add(row.entity_id)
except:
   logging.error(Query '%s' returned ERROR  % (query_msg))
   raise

 Using async just with select = would mean instead of 1 async query (example:
 in (0, 1, 2)), I would do several, one for each value of values array
 above.
 In my head, this would mean more connections to Cassandra and the same
 amount of work, right? What would be the advantage?

 []s




 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Your other option is to fire off async queries.  It's pretty
 straightforward w/ the java or python drivers.

 On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  I was taking a look at Cassandra anti-patterns list:
 
 
  http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
 
  Among then is
 
  SELECT ... IN or index lookups¶
 
  SELECT ... IN and index lookups (formerly secondary indexes) should be
  avoided except for specific scenarios. See When not to use IN in SELECT
  and
  When not to use an index in Indexing in
 
  CQL for Cassandra 2.0
 
  And Looking at the SELECT doc, I saw:
 
  When not to use IN¶
 
  The recommendations about when not to use an index apply to using IN in
  the
  WHERE clause. Under most conditions, using IN in the WHERE clause is not
  recommended. Using IN can degrade performance because usually many nodes
  must be queried. For example, in a single, local data center cluster
  having
  30 nodes, a replication factor of 3, and a consistency level of
  LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
  uses the IN condition, the number of nodes being queried are most likely
  even higher, up to 20 nodes depending on where the keys fall in the
  token
  range.
 
  In my system, I have a column family called entity_lookup:
 
  CREATE KEYSPACE IF NOT EXISTS Identification1
WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
'DC1' : 3 };
  USE Identification1;
 
  CREATE TABLE IF NOT EXISTS entity_lookup (
name varchar,
value varchar,
entity_id uuid,
PRIMARY KEY ((name, value), entity_id));
 
  And I use the following select to query it:
 
  SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
 
  Is this an anti-pattern?
 
  If not using SELECT IN, which other way would you recomend for lookups
  like
  that? I have several values I would like to search in cassandra and they
  might not be in the same particion, as above.
 
  Is Cassandra the wrong tool for lookups like that?
 
  Best regards,
  Marcelo Valle.
 
 
 
 
 
 
 
 
 
 
 



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade





-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade


Re: Best way to do a multi_get using CQL

2014-06-19 Thread Marcelo Elias Del Valle
This is interesting, I didn't know that!
It might make sense then to use select = + async + token aware, I will try
to change my code.

But would it be a recomended solution for these cases? Any other options?

I still would if this is the right use case for Cassandra, to look for
random keys in a huge cluster. After all, the amount of connections to
Cassandra will still be huge, right... Wouldn't it be a problem?
Or when you use async the driver reuses the connection?

[]s


2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 If you use async and your driver is token aware, it will go to the
 proper node, rather than requiring the coordinator to do so.

 Realistically you're going to have a connection open to every server
 anyways.  It's the difference between you querying for the data
 directly and using a coordinator as a proxy.  It's faster to just ask
 the node with the data.

 On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  But using async queries wouldn't be even worse than using SELECT IN?
  The justification in the docs is I could query many nodes, but I would
 still
  do it.
 
  Today, I use both async queries AND SELECT IN:
 
  SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP +  WHERE
  name=%s and value in(%s)
 
  for name, values in identifiers.items():
 query = self.SELECT_ENTITY_LOOKUP % ('%s',
 ','.join(['%s']*len(values)))
 args = [name] + values
 query_msg = query % tuple(args)
 futures.append((query_msg, self.session.execute_async(query, args)))
 
  for query_msg, future in futures:
 try:
rows = future.result(timeout=10)
for row in rows:
  entity_ids.add(row.entity_id)
 except:
logging.error(Query '%s' returned ERROR  % (query_msg))
raise
 
  Using async just with select = would mean instead of 1 async query
 (example:
  in (0, 1, 2)), I would do several, one for each value of values array
  above.
  In my head, this would mean more connections to Cassandra and the same
  amount of work, right? What would be the advantage?
 
  []s
 
 
 
 
  2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  Your other option is to fire off async queries.  It's pretty
  straightforward w/ the java or python drivers.
 
  On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   I was taking a look at Cassandra anti-patterns list:
  
  
  
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
  
   Among then is
  
   SELECT ... IN or index lookups¶
  
   SELECT ... IN and index lookups (formerly secondary indexes) should be
   avoided except for specific scenarios. See When not to use IN in
 SELECT
   and
   When not to use an index in Indexing in
  
   CQL for Cassandra 2.0
  
   And Looking at the SELECT doc, I saw:
  
   When not to use IN¶
  
   The recommendations about when not to use an index apply to using IN
 in
   the
   WHERE clause. Under most conditions, using IN in the WHERE clause is
 not
   recommended. Using IN can degrade performance because usually many
 nodes
   must be queried. For example, in a single, local data center cluster
   having
   30 nodes, a replication factor of 3, and a consistency level of
   LOCAL_QUORUM, a single key query goes out to two nodes, but if the
 query
   uses the IN condition, the number of nodes being queried are most
 likely
   even higher, up to 20 nodes depending on where the keys fall in the
   token
   range.
  
   In my system, I have a column family called entity_lookup:
  
   CREATE KEYSPACE IF NOT EXISTS Identification1
 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
 'DC1' : 3 };
   USE Identification1;
  
   CREATE TABLE IF NOT EXISTS entity_lookup (
 name varchar,
 value varchar,
 entity_id uuid,
 PRIMARY KEY ((name, value), entity_id));
  
   And I use the following select to query it:
  
   SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
  
   Is this an anti-pattern?
  
   If not using SELECT IN, which other way would you recomend for lookups
   like
   that? I have several values I would like to search in cassandra and
 they
   might not be in the same particion, as above.
  
   Is Cassandra the wrong tool for lookups like that?
  
   Best regards,
   Marcelo Valle.
  
  
  
  
  
  
  
  
  
  
  
 
 
 
  --
  Jon Haddad
  http://www.rustyrazorblade.com
  skype: rustyrazorblade
 
 



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade



Re: Best way to do a multi_get using CQL

2014-06-19 Thread Jonathan Haddad
The only case in which it might be better to use an IN clause is if
the entire query can be satisfied from that machine.  Otherwise, go
async.

The native driver reuses connections and intelligently manages the
pool for you.  It can also multiplex queries over a single connection.

I am assuming you're using one of the datastax drivers for CQL, btw.

Jon

On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 This is interesting, I didn't know that!
 It might make sense then to use select = + async + token aware, I will try
 to change my code.

 But would it be a recomended solution for these cases? Any other options?

 I still would if this is the right use case for Cassandra, to look for
 random keys in a huge cluster. After all, the amount of connections to
 Cassandra will still be huge, right... Wouldn't it be a problem?
 Or when you use async the driver reuses the connection?

 []s


 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 If you use async and your driver is token aware, it will go to the
 proper node, rather than requiring the coordinator to do so.

 Realistically you're going to have a connection open to every server
 anyways.  It's the difference between you querying for the data
 directly and using a coordinator as a proxy.  It's faster to just ask
 the node with the data.

 On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  But using async queries wouldn't be even worse than using SELECT IN?
  The justification in the docs is I could query many nodes, but I would
  still
  do it.
 
  Today, I use both async queries AND SELECT IN:
 
  SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP + 
  WHERE
  name=%s and value in(%s)
 
  for name, values in identifiers.items():
 query = self.SELECT_ENTITY_LOOKUP % ('%s',
  ','.join(['%s']*len(values)))
 args = [name] + values
 query_msg = query % tuple(args)
 futures.append((query_msg, self.session.execute_async(query, args)))
 
  for query_msg, future in futures:
 try:
rows = future.result(timeout=10)
for row in rows:
  entity_ids.add(row.entity_id)
 except:
logging.error(Query '%s' returned ERROR  % (query_msg))
raise
 
  Using async just with select = would mean instead of 1 async query
  (example:
  in (0, 1, 2)), I would do several, one for each value of values array
  above.
  In my head, this would mean more connections to Cassandra and the same
  amount of work, right? What would be the advantage?
 
  []s
 
 
 
 
  2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  Your other option is to fire off async queries.  It's pretty
  straightforward w/ the java or python drivers.
 
  On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   I was taking a look at Cassandra anti-patterns list:
  
  
  
   http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
  
   Among then is
  
   SELECT ... IN or index lookups¶
  
   SELECT ... IN and index lookups (formerly secondary indexes) should
   be
   avoided except for specific scenarios. See When not to use IN in
   SELECT
   and
   When not to use an index in Indexing in
  
   CQL for Cassandra 2.0
  
   And Looking at the SELECT doc, I saw:
  
   When not to use IN¶
  
   The recommendations about when not to use an index apply to using IN
   in
   the
   WHERE clause. Under most conditions, using IN in the WHERE clause is
   not
   recommended. Using IN can degrade performance because usually many
   nodes
   must be queried. For example, in a single, local data center cluster
   having
   30 nodes, a replication factor of 3, and a consistency level of
   LOCAL_QUORUM, a single key query goes out to two nodes, but if the
   query
   uses the IN condition, the number of nodes being queried are most
   likely
   even higher, up to 20 nodes depending on where the keys fall in the
   token
   range.
  
   In my system, I have a column family called entity_lookup:
  
   CREATE KEYSPACE IF NOT EXISTS Identification1
 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
 'DC1' : 3 };
   USE Identification1;
  
   CREATE TABLE IF NOT EXISTS entity_lookup (
 name varchar,
 value varchar,
 entity_id uuid,
 PRIMARY KEY ((name, value), entity_id));
  
   And I use the following select to query it:
  
   SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
  
   Is this an anti-pattern?
  
   If not using SELECT IN, which other way would you recomend for
   lookups
   like
   that? I have several values I would like to search in cassandra and
   they
   might not be in the same particion, as above.
  
   Is Cassandra the wrong tool for lookups like that?
  
   Best regards,
   Marcelo Valle.
  
  
  
  
  
  
  
  
  
  
  
 
 
 
  --
  Jon Haddad
  http://www.rustyrazorblade.com
  skype: 

Re: EBS SSD - Cassandra ?

2014-06-19 Thread Ben Bromhead
Irrespective of performance and latency numbers there are fundamental flaws 
with using EBS/NAS and Cassandra, particularly around bandwidth contention and 
what happens when the shared storage medium breaks. Also obligatory reference 
to http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html.

Regarding ENI

AWS are pretty explicit about it’s impact on bandwidth:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html
Attaching another network interface to an instance is not a method to increase 
or double the network bandwidth to or from the dual-homed instance.

So Nate you are right in that it is a function of logical separation helps for 
some reason. 
 

Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359

On 20 Jun 2014, at 8:17 am, Nate McCall n...@thelastpickle.com wrote:

 Sorry - should have been clear I was speaking in terms of route optimizing, 
 not bandwidth. No idea as to the implementation (probably instance specific) 
 and I doubt it actually doubles bandwidth. 
 
 Specifically: having an ENI dedicated to API traffic did smooth out some 
 recent load tests we did for a client. It could be that overall throughput 
 increases where more a function of cleaner traffic segmentation/smoother 
 routing. We werent being terribly scientific - was more an artifact of 
 testing network segmentation. 
 
 I'm just going to say that using an ENI will make things better (since 
 traffic segmentation is always good practice anyway :)  YMMV. 
 
 
 
 On Thu, Jun 19, 2014 at 3:39 PM, Russell Bradberry rbradbe...@gmail.com 
 wrote:
 does an elastic network interface really use a different physical network 
 interface? or is it just to give the ability for multiple ip addresses?
 
 
 
 On June 19, 2014 at 3:56:34 PM, Nate McCall (n...@thelastpickle.com) wrote:
 
 If someone really wanted to try this it, I recommend adding an Elastic 
 Network Interface or two for gossip and client/API traffic. This lets EBS 
 and management traffic have the pre-configured network. 
 
 
 On Thu, Jun 19, 2014 at 6:54 AM, Benedict Elliott Smith 
 belliottsm...@datastax.com wrote:
 I would say this is worth benchmarking before jumping to conclusions. The 
 network being a bottleneck (or latency causing) for EBS is, to my knowledge, 
 supposition, and instances can be started with direct connections to EBS if 
 this is a concern. The blog post below shows that even without SSDs the 
 EBS-optimised provisioned-IOPS instances show pretty consistent latency 
 numbers, although those latencies are higher than you would typically expect 
 from locally attached storage.
 
 http://blog.parse.com/2012/09/17/parse-databases-upgraded-to-amazon-provisioned-iops/
 
 Note, I'm not endorsing the use of EBS. Cassandra is designed to scale up 
 with number of nodes, not with depth of nodes (as Ben mentions, saturating a 
 single node's data capacity is pretty easy these days. CPUs rapidly become 
 the bottleneck as you try to go deep). However the argument that EBS cannot 
 provide consistent performance seems overly pessimistic, and should probably 
 be empirically determined for your use case.
 
 
 On Thu, Jun 19, 2014 at 9:50 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 Ok, looks fair enough.
 
 Thanks guys. I would be great to be able to add disks when amount of data 
 raises and add nodes when throughput increases... :)
 
 
 2014-06-19 5:27 GMT+02:00 Ben Bromhead b...@instaclustr.com:
 
 http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html
 
 From the link:
 
 EBS volumes are not recommended for Cassandra data volumes for the following 
 reasons:
 
 • EBS volumes contend directly for network throughput with standard packets. 
 This means that EBS throughput is likely to fail if you saturate a network 
 link.
 • EBS volumes have unreliable performance. I/O performance can be 
 exceptionally slow, causing the system to back load reads and writes until 
 the entire cluster becomes unresponsive.
 • Adding capacity by increasing the number of EBS volumes per host does not 
 scale. You can easily surpass the ability of the system to keep effective 
 buffer caches and concurrently serve requests for all of the data it is 
 responsible for managing.
 
 Still applies, especially the network contention and latency issues. 
 
 Ben Bromhead
 Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359
 
 On 18 Jun 2014, at 7:18 pm, Daniel Chia danc...@coursera.org wrote:
 
 While they guarantee IOPS, they don't really make any guarantees about 
 latency. Since EBS goes over the network, there's so many things in the 
 path of getting at your data, I would be concerned with random latency 
 spikes, unless proven otherwise.
 
 Thanks,
 Daniel
 
 
 On Wed, Jun 18, 2014 at 1:58 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 In this document it is said :
 
 Provisioned IOPS (SSD) - Volumes of this type are ideal for the most 
 demanding I/O