Frequent secondary index sstable corruption

2014-06-10 Thread Jeremy Jongsma
I'm in the process of migrating data over to cassandra for several of our
apps, and a few of the schemas use secondary indexes. Four times in the
last couple months I've run into a corrupted sstable belonging to a
secondary index, but have never seen this on any other sstables. When it
happens, any query against the secondary index just hangs until the node is
fixed. It's making me a bit nervous about using secondary indexes in
production.

This has usually happened after a bulk data import, so I am wondering if
the firehose method of dumping initial data into cassandra (write
consistency = any) is causing some sort of write concurrency issue when it
comes to secondary indexes. Has anyone else experienced this?

The cluster is running 1.2.16 on 4x EC2 m1.large instances.


Re: Migration 1.2.14 to 2.0.8 causes Tried to create duplicate hard link at startup

2014-06-10 Thread Chris Burroughs

Were you able to solve or work around this problem?

On 06/05/2014 11:47 AM, Tom van den Berge wrote:

Hi,

I'm trying to migrate a development cluster from 1.2.14 to 2.0.8. When
starting up 2.0.8, I'm seeing the following error in the logs:


  INFO 17:40:25,405 Snapshotting drillster, Account to
pre-sstablemetamigration
ERROR 17:40:25,407 Exception encountered during startup
java.lang.RuntimeException: Tried to create duplicate hard link to
/Users/tom/cassandra-data/data/drillster/Account/snapshots/pre-sstablemetamigration/drillster-Account-ic-65-Filter.db
 at
org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:75)
 at
org.apache.cassandra.db.compaction.LegacyLeveledManifest.snapshotWithoutCFS(LegacyLeveledManifest.java:129)
 at
org.apache.cassandra.db.compaction.LegacyLeveledManifest.migrateManifests(LegacyLeveledManifest.java:91)
 at
org.apache.cassandra.db.compaction.LeveledManifest.maybeMigrateManifests(LeveledManifest.java:617)
 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:274)
 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)


Does anyone have an idea how to solve this?


Thanks,
Tom





Re: Frequent secondary index sstable corruption

2014-06-10 Thread Robert Coli
On Tue, Jun 10, 2014 at 7:31 AM, Jeremy Jongsma jer...@barchart.com wrote:

 I'm in the process of migrating data over to cassandra for several of our
 apps, and a few of the schemas use secondary indexes. Four times in the
 last couple months I've run into a corrupted sstable belonging to a
 secondary index, but have never seen this on any other sstables. When it
 happens, any query against the secondary index just hangs until the node is
 fixed. It's making me a bit nervous about using secondary indexes in
 production.


http://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201405.mbox/%3CCAEDUwd1i2BwJ-PAFE1qhjQFZ=qz2va_vxwo_jdycms8evkb...@mail.gmail.com%3E

I don't know if this particular issue is known and/or fixed upstream, but
FWIW/FYI!

=Rob


Re: Frequent secondary index sstable corruption

2014-06-10 Thread Tyler Hobbs
If you've been dropping and recreating tables with the same name, you might
be seeing this: https://issues.apache.org/jira/browse/CASSANDRA-6525


On Tue, Jun 10, 2014 at 12:19 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Jun 10, 2014 at 7:31 AM, Jeremy Jongsma jer...@barchart.com
 wrote:

 I'm in the process of migrating data over to cassandra for several of our
 apps, and a few of the schemas use secondary indexes. Four times in the
 last couple months I've run into a corrupted sstable belonging to a
 secondary index, but have never seen this on any other sstables. When it
 happens, any query against the secondary index just hangs until the node is
 fixed. It's making me a bit nervous about using secondary indexes in
 production.



 http://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201405.mbox/%3CCAEDUwd1i2BwJ-PAFE1qhjQFZ=qz2va_vxwo_jdycms8evkb...@mail.gmail.com%3E

 I don't know if this particular issue is known and/or fixed upstream, but
 FWIW/FYI!

 =Rob




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Cannot query secondary index

2014-06-10 Thread Redmumba
Honestly, this has been by far my single biggest obstacle with Cassandra
for time-based data--cleaning up the old data when the deletion criteria
(i.e., date) isn't the primary key.  I've asked about a few different
approaches, but I haven't really seen any feasible options that can be
implemented easily.  I've seen the following:

   1. Use date-based tables, then drop old tables, ala
   audit_table_20140610, audit_table_20140609, etc..
   But then I run into the issue of having to query every table--I would
   have to execute queries against every day to get the data, and then merge
   the data myself.  Unless, there's something in the binary driver I'm
   missing, it doesn't sound like this would be practical.
   2. Use a TTL
   But then I have to basically decide on a value that works for everything
   and, if it ever turns out I overestimated, I'm basically SOL, because my
   cluster will be out of space.
   3. Maintain a separate index of days to keys, and use this index as the
   reference for which keys to delete.
   But then this requires maintaining another index and a relatively manual
   delete.

I can't help but feel that I am just way over-engineering this, or that I'm
missing something basic in my data model.  Except for the last approach, I
can't help but feel that I'm overlooking something obvious.

Andrew


Of course, Jonathan, I'll do my best!

It's an auditing table that, right now, uses a primary key consisting of a
combination of a combined partition id of the region and the object id, the
date, and the process ID.  Each event in our system will create anywhere
from 1-20 rows, for example, and multiple parts of the system might be
working on the same object ID.  So the CF is constantly being appended
to, but reads are rare.

CREATE TABLE audit (
 id bigint,
 region ascii,
 date timestamp,
 pid int,
 PRIMARY KEY ((id, region), date, pid)
 );


Data is queried on a specific object ID and region.  Optionally, users can
restrict their query to a specific date range, which the above data model
provides.

However, we generate quite a bit of data, and we want a convenient way to
get rid of the oldest data.  Since our system scales with the time of year,
we might get 50GB a day during peak, and 5GB of data off peak.  We could
pick the safest number--let's say, 30 days--and set the TTL using that.
The problem there is that, most of the year, we'll be using a very small
percentage of our available space 90% of the year.

What I'd like to be able to do is drop old tables as needed--i.e., let's
say when we hit 80% load across the cluster (or some such metric that takes
the cluster-wide load into account), I want to drop the oldest day's
records until we're under 80%.  That way, we're always using the maximum
amount of space we can, without having to worry about getting to the point
where we run out of space cluster-wide.

My thoughts are--we could always make the date part of the primary key, but
then we'd either a) have to query the entire range of dates, or b) we'd
have to force a small date range when querying.  What are the penalties?
Do you have any other suggestions?


On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield jlacefi...@datastax.com
wrote:

 Hello,

   Will you please describe the use case and what you are trying to model.
  What are some questions/queries that you would like to serve via
 Cassandra.  This will help the community help you a little better.

 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
  http://www.linkedin.com/in/jlacefield

 http://www.datastax.com/cassandrasummit14



 On Mon, Jun 9, 2014 at 7:51 PM, Redmumba redmu...@gmail.com wrote:

 I've been trying to work around using date-based tables because I'd
 like to avoid the overhead.  It seems, however, that this is just not going
 to work.

 So here's a question--for these date-based tables (i.e., a table per
 day/week/month/whatever), how are they queried?  If I keep 60 days worth of
 auditing data, for example, I'd need to query all 60 tables--can I do that
 smoothly?  Or do I have to have 60 different select statements?  Is there a
 way for me to run the same query against all the tables?


 On Mon, Jun 9, 2014 at 3:42 PM, Redmumba redmu...@gmail.com wrote:

 Ah, so the secondary indices are really secondary against the primary
 key.  That makes sense.

 I'm beginning to see why the whole date-based table approach is the
 only one I've been able to find... thanks for the quick responses, guys!


 On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski 
 michal.michal...@boxever.com wrote:

 Secondary indexes internally are just CFs that map the indexed value to
 a row key which that value belongs to, so you can only query these indexes
 using =, not , = etc.

 However, your query does not require index *IF* you provide a row key -
 you can use  or  like you did for the date column, as long as you
 refer to a single row. However, if you don't provide it, it's not going to
 

Re: How to restart bootstrap after a failed streaming due to Broken Pipe (1.2.16)

2014-06-10 Thread Robert Coli
On Mon, Jun 9, 2014 at 10:43 PM, Colin Kuo colinkuo...@gmail.com wrote:

 You can use nodetool repair instead. Repair is able to re-transmit the
 data which belongs to new node.


Repair is not very likely to work in cases where bootstrap doesn't.

@OP : you probably will have to tune your phi detector to be more tolerant
of nodes pausing.

https://issues.apache.org/jira/browse/CASSANDRA-7063

(etc.)

=Rob


Adding and removing node procedures

2014-06-10 Thread ng
I just wanted to verify the procedures to add and remove nodes in my
environment, please feel free to comments or advise.



I have 3 node cluster N1, N2, N3 with Vnode configured as (256) on each
node. All are in one data center.

1. Procedure to Change node hardware or replace to new node machines
(N1, N2 and N3) to (N11, N22 and N31)


nodetool -h node2 decommission
Bootstrap N21
nodetool repair
nodetool -h node1 decommission
Bootstrap N11
nodetool repair
nodetool -h node3 decommission
Bootstrap N31
nodetool repair

---
2. Procedure for changing 3 nodes cluster to 2 nodes cluster
(N1, N2 and N3)  to (N1, N3)

nodetool -h node2 decommission
Physically get rid of Node2
---
3. Procedure for adding new node

(N1, N2 and N3)  to (N1, N2, N3, N4)
Bootstrap N4
nodetool repair

---
4. Procedure to remove dead node/crashed node.
(node n2 unable to start)
(n1,n2, n3) to (n1,n3)

Shutdown N2 if possible
nodetool removenode xx_hostid_Of_N2_xx
nodetool repair

---
5. Procedure to remove dead node/crashed node and replace with N21.
(node n2 unable to start)
(n1,n2, n3) to (n1,n3, n21)

Shutdown N2 if possible
nodetool removenode xx_hostid_Of_N2_xx
Bootstrap N21
nodetool repair
---


Thanks in advance for pointing any mistake or advise.


StreamException while adding nodes

2014-06-10 Thread Philipp Potisk
Hi,

I tried to double the size of an existing cluster from 4 to 8 nodes. First
I added one node, which joined after 120min successfully. During that time
there was no additional load on the cluster. Afterwards I started the other
3 new nodes after each other in order to join the cluster simultaneously.
Furthermore I put some write-load on the cluster. After 45min of the
process 2 nodes died with following exception.

Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at
org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
at
com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)

Since I have restarted Cassandra on the failing nodes (8 hours ago), the 3
nodes remain in status JOINING, but there is no data exchange going on any
more.

Furthermore, nodetool info throws the exception:

Exception in thread main java.lang.AssertionError
at
org.apache.cassandra.locator.TokenMetadata.getTokens(TokenMetadata.java:502)
at
org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2132)

which corresponds to isMember returning FALSE.

 public CollectionToken getTokens(InetAddress endpoint)
{
assert endpoint != null;
assert isMember(endpoint);


My questions right now are:
- What could have caused the streaming error?
- Shouldn't nodes be added while there is some load on the cluster? OS load
was between 2 and 6 on a dual core machine.
- Would it have been better to add the 3 new nodes one by one, rather than
simultaneously?
- How should I proceed with the 3 half joined nodes as they are not willing
to exchange the missing data?

We are using, Cassandra 2.0.7 (vnodes and broadly the default config) and
RF 2, with each node having roughly 17 GB of data on it.

Thanks for any hints,
Phil


Re: Cannot query secondary index

2014-06-10 Thread Paulo Ricardo Motta Gomes
Our approach for this scenario is to run a hadoop job that periodically
cleans old entries, but I admit it's far from ideal. Would be nice to have
a more native way to perform these kinds of tasks.

There's a legend about a compaction strategy that keeps only the N first
entries of a partition key, but I don't think it was implemented yet, but
if I remember correctly there's a JIRA ticket about it.


On Tue, Jun 10, 2014 at 3:39 PM, Redmumba redmu...@gmail.com wrote:

 Honestly, this has been by far my single biggest obstacle with Cassandra
 for time-based data--cleaning up the old data when the deletion criteria
 (i.e., date) isn't the primary key.  I've asked about a few different
 approaches, but I haven't really seen any feasible options that can be
 implemented easily.  I've seen the following:

1. Use date-based tables, then drop old tables, ala
audit_table_20140610, audit_table_20140609, etc..
But then I run into the issue of having to query every table--I would
have to execute queries against every day to get the data, and then merge
the data myself.  Unless, there's something in the binary driver I'm
missing, it doesn't sound like this would be practical.
2. Use a TTL
But then I have to basically decide on a value that works for
everything and, if it ever turns out I overestimated, I'm basically SOL,
because my cluster will be out of space.
3. Maintain a separate index of days to keys, and use this index as
the reference for which keys to delete.
But then this requires maintaining another index and a relatively
manual delete.

 I can't help but feel that I am just way over-engineering this, or that
 I'm missing something basic in my data model.  Except for the last
 approach, I can't help but feel that I'm overlooking something obvious.

 Andrew


 Of course, Jonathan, I'll do my best!

 It's an auditing table that, right now, uses a primary key consisting of a
 combination of a combined partition id of the region and the object id, the
 date, and the process ID.  Each event in our system will create anywhere
 from 1-20 rows, for example, and multiple parts of the system might be
 working on the same object ID.  So the CF is constantly being appended
 to, but reads are rare.

 CREATE TABLE audit (
 id bigint,
 region ascii,
 date timestamp,
 pid int,
 PRIMARY KEY ((id, region), date, pid)
 );


 Data is queried on a specific object ID and region.  Optionally, users can
 restrict their query to a specific date range, which the above data model
 provides.

 However, we generate quite a bit of data, and we want a convenient way to
 get rid of the oldest data.  Since our system scales with the time of year,
 we might get 50GB a day during peak, and 5GB of data off peak.  We could
 pick the safest number--let's say, 30 days--and set the TTL using that.
 The problem there is that, most of the year, we'll be using a very small
 percentage of our available space 90% of the year.

 What I'd like to be able to do is drop old tables as needed--i.e., let's
 say when we hit 80% load across the cluster (or some such metric that takes
 the cluster-wide load into account), I want to drop the oldest day's
 records until we're under 80%.  That way, we're always using the maximum
 amount of space we can, without having to worry about getting to the point
 where we run out of space cluster-wide.

 My thoughts are--we could always make the date part of the primary key,
 but then we'd either a) have to query the entire range of dates, or b) we'd
 have to force a small date range when querying.  What are the penalties?
 Do you have any other suggestions?


 On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield 
 jlacefi...@datastax.com wrote:

 Hello,

   Will you please describe the use case and what you are trying to model.
  What are some questions/queries that you would like to serve via
 Cassandra.  This will help the community help you a little better.

 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
  http://www.linkedin.com/in/jlacefield

 http://www.datastax.com/cassandrasummit14



 On Mon, Jun 9, 2014 at 7:51 PM, Redmumba redmu...@gmail.com wrote:

 I've been trying to work around using date-based tables because I'd
 like to avoid the overhead.  It seems, however, that this is just not going
 to work.

 So here's a question--for these date-based tables (i.e., a table per
 day/week/month/whatever), how are they queried?  If I keep 60 days worth of
 auditing data, for example, I'd need to query all 60 tables--can I do that
 smoothly?  Or do I have to have 60 different select statements?  Is there a
 way for me to run the same query against all the tables?


 On Mon, Jun 9, 2014 at 3:42 PM, Redmumba redmu...@gmail.com wrote:

 Ah, so the secondary indices are really secondary against the primary
 key.  That makes sense.

 I'm beginning to see why the whole date-based table approach is the
 only one I've been able to 

Large number of row keys in query kills cluster

2014-06-10 Thread Jeremy Jongsma
I ran an application today that attempted to fetch 20,000+ unique row keys
in one query against a set of completely empty column families. On a 4-node
cluster (EC2 m1.large instances) with the recommended memory settings (2 GB
heap), every single node immediately ran out of memory and became
unresponsive, to the point where I had to kill -9 the cassandra processes.

Now clearly this query is not the best idea in the world, but the effects
of it are a bit disturbing. What could be going on here? Are there any
other query pitfalls I should be aware of that have the potential to
explode the entire cluster?

-j


Re: Consolidating records and TTL

2014-06-10 Thread Tyler Hobbs
On Thu, Jun 5, 2014 at 2:38 PM, Charlie Mason charlie@gmail.com wrote:


 I can't do the initial account insert with a TTL as I can't guarantee when
 a new value would come along and so replace this account record. However
 when I insert the new account record, instead of deleting the old one could
 I reinsert it with a TTL of say 1 month.

 How would compaction handle this. Would the original record get compacted
 away after 1 month + the GC Grace period or would it hang around still?


Yes, after 1 month + gc_grace, it will be eligible for removal during
compaction.  Of course, a compaction on that sstable still has to take
place before it can be removed.  If you're using
SizeTieredCompactionStrategy (the default) and have a lot of data, that may
take a few more days.


-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Large number of row keys in query kills cluster

2014-06-10 Thread DuyHai Doan
Hello Jeremy

Basically what you are doing is to ask Cassandra to do a distributed full
scan on all the partitions across the cluster, it's normal that the nodes
are somehow stressed.

How did you make the query? Are you using Thrift or CQL3 API?

Please note that there is another way to get all partition keys : SELECT
DISTINCT partition_key FROM..., more details here :
www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
I ran an application today that attempted to fetch 20,000+ unique row keys
in one query against a set of completely empty column families. On a 4-node
cluster (EC2 m1.large instances) with the recommended memory settings (2 GB
heap), every single node immediately ran out of memory and became
unresponsive, to the point where I had to kill -9 the cassandra processes.

Now clearly this query is not the best idea in the world, but the effects
of it are a bit disturbing. What could be going on here? Are there any
other query pitfalls I should be aware of that have the potential to
explode the entire cluster?

-j


Re: Large number of row keys in query kills cluster

2014-06-10 Thread Jeremy Jongsma
I didn't explain clearly - I'm not requesting 2 unknown keys (resulting
in a full scan), I'm requesting 2 specific rows by key.
On Jun 10, 2014 6:02 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Jeremy

 Basically what you are doing is to ask Cassandra to do a distributed full
 scan on all the partitions across the cluster, it's normal that the nodes
 are somehow stressed.

 How did you make the query? Are you using Thrift or CQL3 API?

 Please note that there is another way to get all partition keys : SELECT
 DISTINCT partition_key FROM..., more details here :
 www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
 I ran an application today that attempted to fetch 20,000+ unique row keys
 in one query against a set of completely empty column families. On a 4-node
 cluster (EC2 m1.large instances) with the recommended memory settings (2 GB
 heap), every single node immediately ran out of memory and became
 unresponsive, to the point where I had to kill -9 the cassandra processes.

 Now clearly this query is not the best idea in the world, but the effects
 of it are a bit disturbing. What could be going on here? Are there any
 other query pitfalls I should be aware of that have the potential to
 explode the entire cluster?

 -j



Re: Large number of row keys in query kills cluster

2014-06-10 Thread Laing, Michael
Perhaps if you described both the schema and the query in more detail, we
could help... e.g. did the query have an IN clause with 2 keys? Or is
the key compound? More detail will help.


On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma jer...@barchart.com wrote:

 I didn't explain clearly - I'm not requesting 2 unknown keys
 (resulting in a full scan), I'm requesting 2 specific rows by key.
 On Jun 10, 2014 6:02 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Jeremy

 Basically what you are doing is to ask Cassandra to do a distributed full
 scan on all the partitions across the cluster, it's normal that the nodes
 are somehow stressed.

 How did you make the query? Are you using Thrift or CQL3 API?

 Please note that there is another way to get all partition keys : SELECT
 DISTINCT partition_key FROM..., more details here :
 www.datastax.com/dev/blog/cassandra-2-0-1-2-0-2-and-a-quick-peek-at-2-0-3
 I ran an application today that attempted to fetch 20,000+ unique row
 keys in one query against a set of completely empty column families. On a
 4-node cluster (EC2 m1.large instances) with the recommended memory
 settings (2 GB heap), every single node immediately ran out of memory and
 became unresponsive, to the point where I had to kill -9 the cassandra
 processes.

 Now clearly this query is not the best idea in the world, but the effects
 of it are a bit disturbing. What could be going on here? Are there any
 other query pitfalls I should be aware of that have the potential to
 explode the entire cluster?

 -j




Re: StreamException while adding nodes

2014-06-10 Thread Robert Coli
On Tue, Jun 10, 2014 at 2:21 PM, Philipp Potisk philipp.pot...@geroba.at
wrote:

 First I added one node, which joined after 120min successfully. During
 that time there was no additional load on the cluster. Afterwards I started
 the other 3 new nodes after each other in order to join the cluster
 simultaneously.


Bootstrapping multiple nodes at once is now and has always been Not
Supported, but is such a common thing for new operators to try that there
is now a goal to prevent them from doing it [1].

Cancel those simultaneous bootstraps and do them one at a time, and they'll
probably work.

[1] https://issues.apache.org/jira/browse/CASSANDRA-7069

=Rob


Re: VPC AWS

2014-06-10 Thread Ben Bromhead
Have a look at http://www.tinc-vpn.org/, mesh based and handles multiple 
gateways for the same network in a graceful manner (so you can run two gateways 
per region for HA).

Also supports NAT traversal if you need to do public-private clusters. 

We are currently evaluating it for our managed Cassandra in a VPC solution, but 
we haven’t ever used it in a production environment or with a heavy load, so 
caveat emptor. 

As for the snitch… the GPFS is definitely the most flexible. 

Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359

On 10 Jun 2014, at 1:42 am, Ackerman, Mitchell mitchell.acker...@pgi.com 
wrote:

 Peter,
  
 I too am working on setting up a multi-region VPC Cassandra cluster.  Each 
 region is connected to each other via an OpenVPN tunnel, so we can use 
 internal IP addresses for both the seeds and broadcast address.   This allows 
 us to use the EC2Snitch (my interpretation of the caveat that this snitch 
 won’t work in a multi-region environment is that it won’t work if you can’t 
 use internal IP addresses, which we can via the VPN tunnels).  All the C* 
 nodes find each other, and nodetool (or OpsCenter) shows that we have 
 established a multi-datacenter cluster. 
  
 Thus far, I’m not happy with the performance of the cluster in such a 
 configuration, but I don’t think that it is related to this configuration, 
 though it could be.
  
 Mitchell
  
 From: Peter Sanford [mailto:psanf...@retailnext.net] 
 Sent: Monday, June 09, 2014 7:19 AM
 To: user@cassandra.apache.org
 Subject: Re: VPC AWS
  
 Your general assessments of the limitations of the Ec2 snitches seem to match 
 what we've found. We're currently using the GossipingPropertyFileSnitch in 
 our VPCs. This is also the snitch to use if you ever want to have a DC in EC2 
 and a DC with another hosting provider. 
  
 -Peter
  
 
 On Mon, Jun 9, 2014 at 5:48 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 Hi guys, there is a lot of answer, it looks like this subject is interesting 
 a lot of people, so I will end up letting you know how it went for us.
  
 For now, we are still doing some tests.
  
 Yet I would like to know how we are supposed to configure Cassandra in this 
 environment :
  
 - VPC 
 - Multiple datacenters (should be VPCs, one per region, linked through VPN ?)
 - Cassandra 1.2
  
 We are currently running under EC2MultiRegionSnitch, but with no VPC. Our VPC 
 will have no public interface, so I am not sure how to configure broadcast 
 address or seeds that are supposed to be the public IP of the node.
  
 I could use EC2Snitch, but will cross region work properly ?
  
 Should I use an other snitch ?
  
 Is someone using a similar configuration ?
  
 Thanks for information already given guys, we will achieve this ;-).
  
 
 2014-06-07 0:05 GMT+02:00 Jonathan Haddad j...@jonhaddad.com:
  
 This may not help you with the migration, but it may with maintenance  
 management.  I just put up a blog post on managing VPC security groups with a 
 tool I open sourced at my previous company.  If you're going to have 
 different VPCs (staging / prod), it might help with managing security groups.
  
 http://rustyrazorblade.com/2014/06/an-introduction-to-roadhouse/
  
 Semi shameless plug... but relevant.
  
 
 On Thu, Jun 5, 2014 at 12:01 PM, Aiman Parvaiz ai...@shift.com wrote:
 Cool, thanks again for this.
  
 
 On Thu, Jun 5, 2014 at 11:51 AM, Michael Theroux mthero...@yahoo.com wrote:
 You can have a ring spread across EC2 and the public subnet of a VPC.  That 
 is how we did our migration.  In our case, we simply replaced the existing 
 EC2 node with a new instance in the public VPC, restored from a backup taken 
 right before the switch.
  
 -Mike
  
 From: Aiman Parvaiz ai...@shift.com
 To: Michael Theroux mthero...@yahoo.com 
 Cc: user@cassandra.apache.org user@cassandra.apache.org 
 Sent: Thursday, June 5, 2014 2:39 PM
 Subject: Re: VPC AWS
  
 Thanks for this info Michael. As far as restoring node in public VPC is 
 concerned I was thinking ( and I might be wrong here) if we can have a ring 
 spread across EC2 and public subnet of a VPC, this way I can simply 
 decommission nodes in Ec2 as I gradually introduce new nodes in public subnet 
 of VPC and I will end up with a ring in public subnet and then migrate them 
 from public to private in a similar way may be.
  
 If anyone has any experience/ suggestions with this please share, would 
 really appreciate it.
  
 Aiman
  
 
 On Thu, Jun 5, 2014 at 10:37 AM, Michael Theroux mthero...@yahoo.com wrote:
 The implementation of moving from EC2 to a VPC was a bit of a juggling act.  
 Our motivation was two fold:
  
 1) We were running out of static IP addresses, and it was becoming 
 increasingly difficult in EC2 to design around limiting the number of static 
 IP addresses to the number of public IP addresses EC2 allowed
 2) VPC affords us an additional level of security that was desirable.
  
 However, we needed to consider the following