Large size KS management

2018-04-19 Thread Aiman Parvaiz
Hi all

I have been given a 15 nodes C* 2.2.8 cluster to manage which has a large size 
KS (~800GB). Given the size of the KS most of the management tasks like repair 
take a long time to complete and disk space management is becoming tricky from 
the systems perspective.


This KS size is going to go up in future and we have a business requirement of 
long data retention here. I wanted to share this with all of you and ask what 
are my options here, what would be the best way to deal with a large size KS 
like this one. To make situation even trickier low IO latency is expected from 
this cluster as well.


Thankful for any suggestions/advice in advance.




Re: Need help with incremental repair

2017-10-29 Thread Aiman Parvaiz
Thanks Blake and Paulo for the response.

Yes, the idea is to go back to non incremental repairs. I am waiting for all 
the "anticompaction after repair" activities to complete and in my 
understanding( thanks to Blake for the explanation ), I can run a full repair 
on that KS and then get back to my non incremental repair regiment.


I assume that I should mark the SSTs to un repaired first and then run a full 
repair?

Also, although I am installing Cassandra from package dsc22 on my CentOS 7 I 
couldn't find sstable tools installed, need to figure that out too.


From: Paulo Motta <pauloricard...@gmail.com>
Sent: Sunday, October 29, 2017 1:56:38 PM
To: user@cassandra.apache.org
Subject: Re: Need help with incremental repair

> Assuming the situation is just "we accidentally ran incremental repair", you 
> shouldn't have to do anything. It's not going to hurt anything

Once you run incremental repair, your data is permanently marked as
repaired, and is no longer compacted with new non-incrementally
repaired data. This can cause read fragmentation and prevent deleted
data from being purged. If you ever run incremental repair and want to
switch to non-incremental repair, you should manually mark your
repaired SSTables as not-repaired with the sstablerepairedset tool.

2017-10-29 3:05 GMT+11:00 Blake Eggleston <beggles...@apple.com>:
> Hey Aiman,
>
> Assuming the situation is just "we accidentally ran incremental repair", you
> shouldn't have to do anything. It's not going to hurt anything. Pre-4.0
> incremental repair has some issues that can cause a lot of extra streaming,
> and inconsistencies in some edge cases, but as long as you're running full
> repairs before gc grace expires, everything should be ok.
>
> Thanks,
>
> Blake
>
>
> On October 28, 2017 at 1:28:42 AM, Aiman Parvaiz (ai...@steelhouse.com)
> wrote:
>
> Hi everyone,
>
> We seek your help in a issue we are facing in our 2.2.8 version.
>
> We have 24 nodes cluster spread over 3 DCs.
>
> Initially, when the cluster was in a single DC we were using The Last Pickle
> reaper 0.5 to repair it with incremental repair set to false. We added 2
> more DCs. Now the problem is that accidentally on one of the newer DCs we
> ran nodetool repair  without realizing that for 2.2 the default
> option is incremental.
>
> I am not seeing any errors in the logs till now but wanted to know what
> would be the best way to handle this situation. To make things a little more
> complicated, the node on which we triggered this repair is almost out of
> disk and we had to restart C* on it.
>
> I can see a bunch of "anticompaction after repair" under Opscenter Activites
> across various nodes in the 3 DCs.
>
>
> Any help, suggestion would be appreciated.
>
> Thanks
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Need help with incremental repair

2017-10-28 Thread Aiman Parvaiz
Hi everyone,

We seek your help in a issue we are facing in our 2.2.8 version.

We have 24 nodes cluster spread over 3 DCs.

Initially, when the cluster was in a single DC we were using The Last Pickle 
reaper 0.5 to repair it with incremental repair set to false. We added 2 more 
DCs. Now the problem is that accidentally on one of the newer DCs we ran 
nodetool repair  without realizing that for 2.2 the default option is 
incremental.

I am not seeing any errors in the logs till now but wanted to know what would 
be the best way to handle this situation. To make things a little more 
complicated, the node on which we triggered this repair is almost out of disk 
and we had to restart C* on it.

I can see a bunch of "anticompaction after repair" under Opscenter Activites 
across various nodes in the 3 DCs.


Any help, suggestion would be appreciated.

Thanks



Re: Reaper 0.7 is released!

2017-09-27 Thread Aiman Parvaiz
Thanks!! Love Reaper :)

Sent from my iPhone

On Sep 27, 2017, at 10:01 AM, Jon Haddad 
> wrote:

Hey folks,

We (The Last Pickle) are proud to announce the release of Reaper 0.7!  In this 
release we've added support to run Reaper across multiple data centers as well 
as supporting Reaper failover when using the Cassandra storage backend.

You can grab DEB, RPM and tarballs off the downloads page: 
http://cassandra-reaper.io/docs/download/

We've made significant improvements to the docs section of the site as well, 
with a slew of other improvements in the works.

The Reaper user mailing list is located here, for questions and feedback: 
https://groups.google.com/forum/#!forum/tlp-apache-cassandra-reaper-users

Thanks,
Jon


Re: Reaper v0.6.1 released

2017-06-15 Thread Aiman Parvaiz
Great work!! Thanks

Sent from my iPhone

On Jun 14, 2017, at 11:30 PM, Shalom Sagges 
> wrote:

That's awesome!! Thanks for contributing! 


[https://signature.s3.amazonaws.com/2015/lp_logo.png]
Shalom Sagges
DBA
T: +972-74-700-4035
[https://signature.s3.amazonaws.com/2015/LinkedIn.png]
  [https://signature.s3.amazonaws.com/2015/Twitter.png] 
   
[https://signature.s3.amazonaws.com/2015/Facebook.png] 

We Create Meaningful Connections





On Thu, Jun 15, 2017 at 2:32 AM, Jonathan Haddad 
> wrote:
Hey folks!

I'm proud to announce the 0.6.1 release of the Reaper project, the open source 
repair management tool for Apache Cassandra.

This release improves the Cassandra backend significantly, making it a first 
class citizen for storing repair schedules and managing repair progress.  It's 
no longer necessary to manage a PostgreSQL DB in addition to your Cassandra DB.

We've been very active since we forked the original Spotify repo.  Since this 
time we've added:

* A native Cassandra backend
* Support for versions > 2.0
* Merged in the WebUI, maintained by Stefan Podkowinski 
(https://github.com/spodkowinski/cassandra-reaper-ui)
* Support for incremental repair (probably best to avoid till Cassandra 4.0, 
see CASSANDRA-9143)

We're excited to continue making improvements past the original intent of the 
project.  With the lack of Cassandra 3.0 support in OpsCenter, there's a gap 
that needs to be filled for tools that help with managing a cluster.  Alex 
Dejanovski showed me a prototype he recently put together for a really nice 
view into cluster health.  We're also looking to add in support common cluster 
operations like snapshots, upgradesstables, cleanup, and setting options at 
runtime.

Grab it here: https://github.com/thelastpickle/cassandra-reaper

Feedback / bug reports / ideas are very much appreciated.

We have a dedicated, low traffic ML here: 
https://groups.google.com/forum/#!forum/tlp-apache-cassandra-reaper-users

Jon Haddad
Principal Consultant, The Last Pickle
http://thelastpickle.com/


This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this message 
or any information herein.
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.


Re: Advice in upgrade plan from 1.2.18 to 2.2.8

2016-12-22 Thread Aiman Parvaiz
Thanks Alain. This was extremely helpful, really grateful.

Aiman
On Dec 22, 2016, at 5:00 AM, Alain RODRIGUEZ 
<arodr...@gmail.com<mailto:arodr...@gmail.com>> wrote:

Hi,

Here are some thoughts:

running 1.2.18. I plan to upgrade them to 2.2.latest

Going 1 major release at the time is probably the safest way to go indeed.


  1.  Install 2.0.latest on one node at a time, start and wait for it to join 
the ring.
  2.  Run upgradesstables on this node.
  3.  Repeat Step 1,2 on each node installing cassandra2.0 in a rolling manner 
and running upgradesstables in parallel. (Please let me know if running 
upgrades stables in parallel is not right here. My cluster is not under much 
load really)

I would:

- Upgrade one node, check for cluster health (monitoring, logs, nodeool 
commands), having a special attention to the 2.0 node.
- If everything is ok, then go for more nodes, if using distinct racks I would 
go per rack; sequentially, node per node, all the nodes from DC1-rack1, then 
DC1-rack2, then DC1-rack3. Then move to the next DC if everything is fine.
- Start the 'upgradesstables' when the cluster is completely and successfully 
running with the new version (2.0.17). It is perfectly fine to run this in 
parallel as the last part of the upgrade. As you guessed, it is good to keep 
monitoring the cluster load.

4. Now I will have both my DCs running 2.0.latest.

Without really having any strong argument, I would let it run for "some time" 
like this, hours at least, maybe days. In any case, you will probably have some 
work to prepare before the next upgrade, so you will have time to check how the 
cluster is doing.

6. Do I need to run upgradesstables here again after the node has started and 
joined? (I think yes, but seek advice. 
https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandra.html)

Yes, every time you run a major upgrade. Anyway, nodetool upgradesstables will 
skip any sstable that do not need to be upgraded (as long as you don't add the 
option to force it), so it is probably better to run it when you have a doubt.


As additional information, I would prepare, for each upgrade:


  *   The new Cassandra configuration (cassandra.yaml and cassandra-sh.yaml 
mainly but also other configuration files)

To do that, I use to merge the current file in use (your configuration on C* 
1.2.18) and the Cassandra version file from github for the new version (i.e. 
https://github.com/apache/cassandra/tree/cassandra-2.0.17/conf).

This allows you to
 *   Acknowledge and consider the new and removed configuration settings
 *   Keep comments and default values in the configuration files up to date
 *   Be fully exhaustive, and learn as you parse the files

  *   Make sure clients will still work with the new version (see the doc, do 
the tests)
  *   Cassandra metrics changed in the latest versions, you might have to 
rework your dashboards. Anticipating the dashboard creation for new versions 
would prevent you from loosing metrics when you need them the most.

Finally keep in mind that you should not perform any streaming while running 
multiple version and as long as 'nodetool upgradesstables' is not completely 
done. Meaning you should not add, remove, replace, move or repair a node. Also, 
I would limit schema changes as much as possible while running multiple 
versions as it caused troubles in the past.

During an upgrade, almost nothing else than the normal load due to the service 
and the upgrade itself should happen. We always try to keep this time window as 
short as possible.

C*heers,
---
Alain Rodriguez - @arodream - 
al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com<http://www.thelastpickle.com/>

2016-12-21 20:36 GMT+01:00 Aiman Parvaiz 
<ai...@steelhouse.com<mailto:ai...@steelhouse.com>>:
Hi everyone,
I have 2 C* DCs with 12 nodes in each running 1.2.18. I plan to upgrade them to 
2.2.latest and wanted to run by you experts my plan.


  1.  Install 2.0.latest on one node at a time, start and wait for it to join 
the ring.
  2.  Run upgradesstables on this node.
  3.  Repeat Step 1,2 on each node installing cassandra2.0 in a rolling manner 
and running upgradesstables in parallel. (Please let me know if running 
upgrades stables in parallel is not right here. My cluster is not under much 
load really)
  4.  Now I will have both my DCs running 2.0.latest.
  5.  Install cassandra 2.1.latest on one node at a time (same as above)
  6.  Do I need to run upgradesstables here again after the node has started 
and joined? (I think yes, but seek advice. 
https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandra.html)
  7.  Following the above pattern, I would install cassandra2.1 in a rolling 
manner across 2 DCs (depending on response to 6 I might or might not run 
upgradesstables)
  8.

Advice in upgrade plan from 1.2.18 to 2.2.8

2016-12-21 Thread Aiman Parvaiz
Hi everyone,
I have 2 C* DCs with 12 nodes in each running 1.2.18. I plan to upgrade them to 
2.2.latest and wanted to run by you experts my plan.


  1.  Install 2.0.latest on one node at a time, start and wait for it to join 
the ring.
  2.  Run upgradesstables on this node.
  3.  Repeat Step 1,2 on each node installing cassandra2.0 in a rolling manner 
and running upgradesstables in parallel. (Please let me know if running 
upgrades stables in parallel is not right here. My cluster is not under much 
load really)
  4.  Now I will have both my DCs running 2.0.latest.
  5.  Install cassandra 2.1.latest on one node at a time (same as above)
  6.  Do I need to run upgradesstables here again after the node has started 
and joined? (I think yes, but seek advice. 
https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdCassandra.html)
  7.  Following the above pattern, I would install cassandra2.1 in a rolling 
manner across 2 DCs (depending on response to 6 I might or might not run 
upgradesstables)
  8.  At this point both DCs would have 2.1.latest and again in rolling manner 
I install 2.2.8.

My assumption is that while this upgrade would be happening C* would still be 
able to serve reads and writes and running different versions at various point 
in the upgrade process will not affect the apps reading/writing from C*.

Thanks



Bootstrapping multiple C* nodes in AWS

2016-08-30 Thread Aiman Parvaiz
Hi all
I am running C* 2.1.12 in AWS EC2 Classic with RF=3 and vnodes(256
tokens/node). My nodes are distributed in three different availability
zones. I want to scale up the cluster size, given the data size per node it
takes around 24 hours to add one node.

I wanted to know if its safe to add multiple nodes at once in AWS and
should I add them in the same availability zone. Would be grateful to hear
your experiences here.

Thanks


Re: Cassandra and Kubernetes and scaling

2016-05-24 Thread Aiman Parvaiz
Looking forward to hearing from the community about this.

Sent from my iPhone

> On May 24, 2016, at 10:19 AM, Mike Wojcikiewicz  wrote:
> 
> I saw a thread from April 2016 talking about Cassandra and Kubernetes, and 
> have a few follow up questions.  It seems that especially after v1.2 of 
> Kubernetes, and the upcoming 1.3 features, this would be a very viable option 
> of running Cassandra on.
> 
> My questions pertain to HostIds and Scaling Up/Down, and are related:
> 
> 1.  If a container's host dies and is then brought up on another host, can 
> you start up with the same PersistentVolume as the original container had?  
> Which begs the question would the new container get a new HostId, implying it 
> would need to bootstrap into the environment?   If it's a bootstrap, does the 
> old one get deco'd/assassinated?
> 
> 2. Scaling up/down.  Scaling up would be relatively easy, as it should just 
> kick off Bootstrapping the node into the cluster, but what if you need to 
> scale down?  Would the Container get deco'd by the scaling down process? or 
> just terminated, leaving you with potential missing replicas
> 
> 3. Scaling up and increasing the RF of a particular keyspace, would there be 
> a clean way to do this with the kubernetes tooling? 
> 
> In the end I'm wondering how much of the Kubernetes + Cassandra involves 
> nodetool, and how much is just a Docker image where you need to manage that 
> all yourself (painfully)
> 
> -- 
> --mike


Re: Cassandra 2.1.12 Node size

2016-04-14 Thread Aiman Parvaiz
Right now the biggest SST which I have is 210GB on a 3 TB disk, total disk
consumed is around 50% on all nodes, I am using SCTS. Read and Write query
latency is under 15ms. Full repair time is long but am sure when I switch
to incremental repairs this would be taken care of. I am hitting the 50%
disk issue. I recently ran the cleanup and backups aren't taking that much
space.

On Thu, Apr 14, 2016 at 8:06 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> The four criteria I would suggest for evaluating node size:
>
> 1. Query latency.
> 2. Query throughput/load
> 3. Repair time - worst case, full repair, what you can least afford if it
> happens at the worst time
> 4. Expected growth over the next six to 18 months - you don't what to be
> scrambling with latency, throughput, and repair problems when you bump into
> a wall on capacity. 20% to 30% is a fair number.
>
> Alas, it is very difficult to determine how much spare capacity you have,
> other than an artificial, synthetic load test: Try 30% more clients and
> queries with 30% more (synthetic) data and see what happens to query
> latency, total throughput, and repair time. Run such a test periodically
> (monthly) to get a heads-up when load is getting closer to a wall.
>
> Incremental repair is great to streamline and optimize your day-to-day
> operations, but focus attention on replacement of down nodes during times
> of stress.
>
>
>
> -- Jack Krupansky
>
> On Thu, Apr 14, 2016 at 10:14 AM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
>> Would adding nodes be the right way to start if I want to get the data
>>> per node down
>>
>>
>> Yes, if everything else is fine, the last and always available option to
>> reduce the disk size per node is to add new nodes. Sometimes it is the
>> first option considered as it is relatively quick and quite strait forward.
>>
>> Again, 50 % of free disk space is not a hard limit. To give you a rough
>> idea, if the biggest sstable is 100 GB big and you still have 400 GB free,
>> you will probably be good to go, excepted if 4 compaction of 100 GB trigger
>> at the same time, filling up the disk.
>>
>> Now is the good time to think of a plan to handle the growth for you, but
>> don't worry if data reaches 60%, it will probably not be a big deal.
>>
>> You can make sure that:
>>
>> - There are no snapshots, heap dumps or data not related with C* taking
>> some space
>> - The biggest sstables tombstone ratio are not too high (are tombstones
>> are correctly evicted ?)
>> - You are using compression (if you want too)
>>
>> Consider:
>>
>> - Adding TTLs to data you don't want to keep forever, shorten TTLs as
>> much as allowed.
>> - Migrating to C*3.0+ and take advantage of the new engine storage
>>
>> C*heers,
>> ---
>> Alain Rodriguez - al...@thelastpickle.com
>> France
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>> 2016-04-14 15:41 GMT+02:00 Aiman Parvaiz <ai...@flipagram.com>:
>>
>>> Thanks for the response Alain. I am using STCS and would like to take
>>> some action as we would be hitting 50% disk space pretty soon. Would adding
>>> nodes be the right way to start if I want to get the data per node down
>>> otherwise can you or someone on the list please suggest the right way to go
>>> about it.
>>>
>>> Thanks
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 14, 2016, at 5:17 PM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I seek advice in data size per node. Each of my node has close to 1 TB
>>>> of data. I am not seeing any issues as of now but wanted to run it by you
>>>> guys if this data size is pushing the limits in any manner and if I should
>>>> be working on reducing data size per node.
>>>
>>>
>>> There is no real limit to the data size other than 50% of the machine
>>> disk space using STCS and 80 % if you are using LCS. Those are 'soft'
>>> limits as it will depend on your biggest sstables size and the number of
>>> concurrent compactions mainly, but to stay away from trouble, it is better
>>> to keep things under control, below the limits mentioned above.
>>>
>>> I will me migrating to incremental repairs shortly and full repair as of
>>>> now takes 20 hr/node. I am not seeing any issues with the nodes for now.
>>>>
>>>
>>> As you noticed, you need to keep in mind that the larger 

Re: Cassandra 2.1.12 Node size

2016-04-14 Thread Aiman Parvaiz
Thanks for the response Alain. I am using STCS and would like to take some 
action as we would be hitting 50% disk space pretty soon. Would adding nodes be 
the right way to start if I want to get the data per node down otherwise can 
you or someone on the list please suggest the right way to go about it.

Thanks

Sent from my iPhone

> On Apr 14, 2016, at 5:17 PM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:
> 
> Hi,
> 
>> I seek advice in data size per node. Each of my node has close to 1 TB of 
>> data. I am not seeing any issues as of now but wanted to run it by you guys 
>> if this data size is pushing the limits in any manner and if I should be 
>> working on reducing data size per node.
> 
> There is no real limit to the data size other than 50% of the machine disk 
> space using STCS and 80 % if you are using LCS. Those are 'soft' limits as it 
> will depend on your biggest sstables size and the number of concurrent 
> compactions mainly, but to stay away from trouble, it is better to keep 
> things under control, below the limits mentioned above.
> 
>> I will me migrating to incremental repairs shortly and full repair as of now 
>> takes 20 hr/node. I am not seeing any issues with the nodes for now.
> 
> As you noticed, you need to keep in mind that the larger the dataset is, the 
> longer operations will take. Repairs but also bootstrap or replace a node, 
> remove a node, any operation that require to stream data or read it. Repair 
> time can be mitigated by using incremental repairs indeed. 
> 
>> I am running a 9 node C* 2.1.12 cluster.
> 
> It should be quite safe to give incremental repair a try as many bugs have 
> been fixe in this version:
> 
> FIX 2.1.12 - A lot of sstables using range repairs due to anticompaction - 
> incremental only
> 
> https://issues.apache.org/jira/browse/CASSANDRA-10422
> 
> FIX 2.1.12 - repair hang when replica is down - incremental only
> 
> https://issues.apache.org/jira/browse/CASSANDRA-10288
> 
> If you are using DTCS be aware of 
> https://issues.apache.org/jira/browse/CASSANDRA-3
> 
> If using LCS, watch closely sstable and compactions pending counts.
> 
> As a general comment, I would say that Cassandra has evolved to be able to 
> handle huge datasets (memory structures off-heap + increase of heap size 
> using G1GC, JBOD, vnodes, ...). Today Cassandra works just fine with big 
> dataset. I have seen clusters with 4+ TB nodes and other using a few GB per 
> node. It all depends on your requirements and your machines spec. If fast 
> operations are absolutely necessary, keep it small. If you want to use the 
> entire disk space (50/80% of total disk space max), go ahead as long as other 
> resources are fine (CPU, memory, disk throughput, ...).
> 
> C*heers,
> 
> -------
> Alain Rodriguez - al...@thelastpickle.com
> France
> 
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> 
> 2016-04-14 10:57 GMT+02:00 Aiman Parvaiz <ai...@flipagram.com>:
>> Hi all,
>> I am running a 9 node C* 2.1.12 cluster. I seek advice in data size per 
>> node. Each of my node has close to 1 TB of data. I am not seeing any issues 
>> as of now but wanted to run it by you guys if this data size is pushing the 
>> limits in any manner and if I should be working on reducing data size per 
>> node. I will me migrating to incremental repairs shortly and full repair as 
>> of now takes 20 hr/node. I am not seeing any issues with the nodes for now.
>> 
>> Thanks
> 


Cassandra 2.1.12 Node size

2016-04-14 Thread Aiman Parvaiz
Hi all,
I am running a 9 node C* 2.1.12 cluster. I seek advice in data size per
node. Each of my node has close to 1 TB of data. I am not seeing any issues
as of now but wanted to run it by you guys if this data size is pushing the
limits in any manner and if I should be working on reducing data size per
node. I will me migrating to incremental repairs shortly and full repair as
of now takes 20 hr/node. I am not seeing any issues with the nodes for now.

Thanks


Re: Need advice for multi DC C* setup

2015-08-17 Thread Aiman Parvaiz
Over the weekend and after some more looking around and following this old
mailing list post

https://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201406.mbox/%3cca+vsrlopop7th8nx20aoz3as75g2jrjm3ryx119deklynhq...@mail.gmail.com%3E

I was able to get my 2 node test env to move to over 3 node RF 3 cluster in
private subnet VPC in region B. I updated the RF when I started replication
between C* in public subnet to private subnet in region B.

The catch here is that since the test env was 2 nodes and both nodes were
in one AZ changing snitch to EC2Multiregion didnt afftect the replica
placement and hence I was able to get away with a rolling restart  but in
production I have 10 nodes spread over 2 AZ running simple snitch. I wonder
what would be the best way to change snitch live in this scenario.

One way I think would be to get all nodes in one AZ and then switch the
snitch that way ec2multiregion would report all nodes in 1 Rack but I am
open to suggestions also if this is a valid concern.

Thanks

On Sun, Aug 16, 2015 at 1:46 AM, Prem Yadav ipremya...@gmail.com wrote:

 I meant the existing nodes must be in the default VPC if you did not
 create one,
 In any case, you can use the VPC peering.

 On Sun, Aug 16, 2015 at 5:34 AM, John Wong gokoproj...@gmail.com wrote:

  The EC2 nodes must be in the default VPC.
 Did you really mean the default VPC created by AWS or just a VPC? Because
 I would be very surprise default VPC must be used.

 On Sat, Aug 15, 2015 at 2:50 AM, Prem Yadav ipremya...@gmail.com wrote:


 The EC2 nodes must be in the default VPC.

 create a ring in the VPC in region B. Use VPC peering to connect the
 default and the region B VPC.
 The two rings should join the existing one. Alter the replication
 strategy to network replication so that the data is replicated to the new
 rings. Repair the keyspaces.
 Once it is done, you can decommission the existing ring.

 For spark,if you are using datastax version, it comes with spark. You
 just need to change a config and spark starts along with cassandra. A
 separate ring is advised for analytics stuff.


 On Sat, Aug 15, 2015 at 1:10 AM, Aiman Parvaiz ai...@flipagram.com
 wrote:

 Hi all
 We are planning to move C* from EC2 (region A) to VPC in region B. I
 will enumerate our goals so that you guys can advice me keeping in mind the
 bigger picture.

 Goals:
 - Move to VPC is another region.
 - Enable Vnodes.
 - Bump up RF to 3.
 - Ability to have a spark cluster.

 I know this is a LOT of work and I know this all might not be possible
 in one go.

 Existing cluster in EC2 is using RF=2, simple snitch and simple
 replication.

 I am not sure what would be the best way to approach this task. So
 please anyone who has done this and would like to share anything I would
 really appreciate the effort.

 Thanks







-- 
*Aiman Parvaiz*
Lead Systems Architect
ai...@flipagram.com
cell: 213-300-6377
http://flipagram.com/apz


Need advice for multi DC C* setup

2015-08-14 Thread Aiman Parvaiz
Hi all
We are planning to move C* from EC2 (region A) to VPC in region B. I will
enumerate our goals so that you guys can advice me keeping in mind the
bigger picture.

Goals:
- Move to VPC is another region.
- Enable Vnodes.
- Bump up RF to 3.
- Ability to have a spark cluster.

I know this is a LOT of work and I know this all might not be possible in
one go.

Existing cluster in EC2 is using RF=2, simple snitch and simple replication.

I am not sure what would be the best way to approach this task. So please
anyone who has done this and would like to share anything I would really
appreciate the effort.

Thanks


Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Aiman Parvaiz
Hi Bryan
How's GC behaving on these boxes?

On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote:

 Hi there,

 Within our Cassandra cluster, we're observing, on occasion, one or two
 nodes at a time becoming partially unresponsive.

 We're running 2.1.7 across the entire cluster.

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.

 Each time we've seen this, the only thing that appears to resolve the
 issue is a restart of the Cassandra process; the restart does not appear to
 be clean, and requires one or more attempts (or a -9 on occasion).

 There does not seem to be any pattern to what machines are affected; the
 nodes thus far have been different instances on different physical machines
 and on different racks.

 Has anyone seen this before? Alternatively, when this happens again, what
 data can we collect that would help with the debugging process (in addition
 to tpstats)?

 Thanks in advance,

 Bryan




-- 
*Aiman Parvaiz*
Lead Systems Architect
ai...@flipagram.com
cell: 213-300-6377
http://flipagram.com/apz


Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Aiman Parvaiz
I faced something similar in past and the reason for nodes becoming 
unresponsive intermittently was Long GC pauses. That's why I wanted to bring 
this to your attention incase GC pause is a potential cause.

Sent from my iPhone

 On Jul 22, 2015, at 4:32 PM, Bryan Cheng br...@blockcypher.com wrote:
 
 Aiman,
 
 Your post made me look back at our data a bit. The most recent occurrence of 
 this incident was not preceded by any abnormal GC activity; however, the 
 previous occurrence (which took place a few days ago) did correspond to a 
 massive, order-of-magnitude increase in both ParNew and CMS collection times 
 which lasted ~17 hours.
 
 Was there something in particular that links GC to these stalls? At this 
 point in time, we cannot identify any particular reason for either that GC 
 spike or the subsequent apparent compaction stall, although it did not seem 
 to have any effect on our usage of the cluster.
 
 On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng br...@blockcypher.com wrote:
 Hi Aiman,
 
 We previously had issues with GC, but since upgrading to 2.1.7 things seem a 
 lot healthier.
 
 We collect GC statistics through collectd via the garbage collector mbean, 
 ParNew GC's report sub 500ms collection time on average (I believe 
 accumulated per minute?) and CMS peaks at about 300ms collection time when 
 it runs.
 
 On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com wrote:
 Hi Bryan
 How's GC behaving on these boxes?
 
 On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote:
 Hi there,
 
 Within our Cassandra cluster, we're observing, on occasion, one or two 
 nodes at a time becoming partially unresponsive.
 
 We're running 2.1.7 across the entire cluster.
 
 nodetool still reports the node as being healthy, and it does respond to 
 some local queries; however, the CPU is pegged at 100%. One common thread 
 (heh) each time this happens is that there always seems to be one of more 
 compaction threads running (via nodetool tpstats), and some appear to be 
 stuck (active count doesn't change, pending count doesn't decrease). A 
 request for compactionstats hangs with no response.
 
 Each time we've seen this, the only thing that appears to resolve the 
 issue is a restart of the Cassandra process; the restart does not appear 
 to be clean, and requires one or more attempts (or a -9 on occasion).
 
 There does not seem to be any pattern to what machines are affected; the 
 nodes thus far have been different instances on different physical 
 machines and on different racks.
 
 Has anyone seen this before? Alternatively, when this happens again, what 
 data can we collect that would help with the debugging process (in 
 addition to tpstats)?
 
 Thanks in advance,
 
 Bryan
 
 
 
 -- 
 Aiman Parvaiz
 Lead Systems Architect
 ai...@flipagram.com
 cell: 213-300-6377
 http://flipagram.com/apz
 


Re: Decommissioned node still in Gossip

2015-06-30 Thread Aiman Parvaiz
I was having exactly the same issue with the same version, check your seed list 
and make sure it contains only the live nodes, I know that seeds are only read 
when cassandra starts but updating the seed list to live nodes and then doing a 
roiling restart fixed this issue for me. 
I hope this helps you.

Thanks

Sent from my iPhone

 On Jun 30, 2015, at 4:42 AM, Jeff Williams je...@wherethebitsroam.com wrote:
 
 Hi,
 
 I have a cluster which had 4 datacenters running 2.0.12. Last week one of the 
 datacenters was decommissioned using nodetool decommission on each of the 
 servers in turn. This seemed to work fine until one of the nodes started 
 appearing in the logs of all of the remaining servers with messages like:
 
  INFO [GossipStage:3] 2015-06-30 11:22:39,189 Gossiper.java (line 924) 
 InetAddress /172.29.8.8 is now DOWN
  INFO [GossipStage:3] 2015-06-30 11:22:39,190 StorageService.java (line 1773) 
 Removing tokens [...] for /172.29.8.8
 
 These come up in the log every minute or two. I believe it may have 
 re-appeared after a repair, but I'm not sure.
 
 The problem is that this node does not exist in nodetool status, nodetool 
 gossipinfo or in the system.peers table. So how can tell the cluster that 
 this node is decommissioned?
 
 Regards,
 Jeff


Re: C* 2.0.15 - java.lang.NegativeArraySizeException

2015-06-09 Thread Aiman Parvaiz
Quick update, saw the same error on another new node, again the node isn't
really misbehaving uptill now.

Thanks

On Mon, Jun 8, 2015 at 9:48 PM, Aiman Parvaiz ai...@flipagram.com wrote:

 Hi everyone
 I am running C* 2.0.9 and decided to do a rolling upgrade. Added a node of
 C* 2.0.15 in the existing cluster and saw this twice:

 Jun  9 02:27:20 prod-cass23.localdomain cassandra: 2015-06-09 02:27:20,658
 INFO CompactionExecutor:4 CompactionTask.runMayThrow - Compacting
 [SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-37-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-40-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-42-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-38-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-39-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-44-Data.db')]



 Jun  9 02:27:20 prod-cass23.localdomain cassandra: 2015-06-09 02:27:20,669
 ERROR CompactionExecutor:4 CassandraDaemon.uncaughtException - Exception in
 thread Thread[CompactionExecutor:4,1,main]
 Jun  9 02:27:20 prod-cass23.localdomain
 *java.lang.NegativeArraySizeException*
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:335)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:462)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:448)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:432)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableReader.getAncestors(SSTableReader.java:1366)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata.createCollector(SSTableMetadata.java:134)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.CompactionTask.createCompactionWriter(CompactionTask.java:316)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:162)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)
 Jun  9 02:27:20 prod-cass23.localdomain at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 Jun  9 02:27:20 prod-cass23.localdomain at
 java.util.concurrent.FutureTask.run(FutureTask.java:262)
 Jun  9 02:27:20 prod-cass23.localdomain at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 Jun  9 02:27:20 prod-cass23.localdomain at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 Jun  9 02:27:20 prod-cass23.localdomain at
 java.lang.Thread.run(Thread.java:745)
 Jun  9 02:27:47 prod-cass23.localdomain cassandra: 2015-06-09 02:27:47,725
 INFO main StorageService.setMode - JOINING: Starting to bootstrap...

 As you can see this happened first time even before Joining. Second
 occasion stack trace:

 Jun  9 02:32:15 prod-cass23.localdomain cassandra: 2015-06-09 02:32:15,097
 ERROR CompactionExecutor:6 CassandraDaemon.uncaughtException - Exception in
 thread Thread[CompactionExecutor:6,1,main]
 Jun  9 02:32:15 prod-cass23.localdomain
 java.lang.NegativeArraySizeException
 Jun  9 02:32:15 prod-cass23.localdomain at
 org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:335)
 Jun  9 02:32:15 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:462)
 Jun  9 02:32:15 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:448)
 Jun  9 02:32:15 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:432)
 Jun  9 02:32:15 prod-cass23.localdomain

Re: C* 2.0.15 - java.lang.NegativeArraySizeException

2015-06-09 Thread Aiman Parvaiz
Thanks Sean, in this scenario also I would end up running 2 versions of
Cassandra as I am planning to do a rolling upgrade and hence zero downtime.
Upgrading in place one node at a time would lead to running 2 versions,
please let me know if I am missing something here.

On Tue, Jun 9, 2015 at 2:00 PM, sean_r_dur...@homedepot.com wrote:

  In my experience, you don’t want to do streaming operations (repairs or
 bootstraps) with mixed Cassandra versions. Upgrade the ring to the new
 version, and then add nodes (or add the nodes at the current version, and
 then upgrade).





 Sean Durity



 *From:* Aiman Parvaiz [mailto:ai...@flipagram.com]
 *Sent:* Tuesday, June 09, 2015 1:29 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: C* 2.0.15 - java.lang.NegativeArraySizeException



 Quick update, saw the same error on another new node, again the node isn't
 really misbehaving uptill now.



 Thanks



 On Mon, Jun 8, 2015 at 9:48 PM, Aiman Parvaiz ai...@flipagram.com wrote:

 Hi everyone

 I am running C* 2.0.9 and decided to do a rolling upgrade. Added a node of
 C* 2.0.15 in the existing cluster and saw this twice:



 Jun  9 02:27:20 prod-cass23.localdomain cassandra: 2015-06-09 02:27:20,658
 INFO CompactionExecutor:4 CompactionTask.runMayThrow - Compacting
 [SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-37-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-40-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-42-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-38-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-39-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-44-Data.db')]







 Jun  9 02:27:20 prod-cass23.localdomain cassandra: 2015-06-09 02:27:20,669
 ERROR CompactionExecutor:4 CassandraDaemon.uncaughtException - Exception in
 thread Thread[CompactionExecutor:4,1,main]

 Jun  9 02:27:20 prod-cass23.localdomain
 *java.lang.NegativeArraySizeException*

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:335)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:462)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:448)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:432)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableReader.getAncestors(SSTableReader.java:1366)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.io.sstable.SSTableMetadata.createCollector(SSTableMetadata.java:134)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.CompactionTask.createCompactionWriter(CompactionTask.java:316)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:162)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)

 Jun  9 02:27:20 prod-cass23.localdomain at
 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)

 Jun  9 02:27:20 prod-cass23.localdomain at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

 Jun  9 02:27:20 prod-cass23.localdomain at
 java.util.concurrent.FutureTask.run(FutureTask.java:262)

 Jun  9 02:27:20 prod-cass23.localdomain at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 Jun  9 02:27:20 prod-cass23.localdomain at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 Jun  9 02:27:20 prod-cass23.localdomain at
 java.lang.Thread.run(Thread.java:745)

 Jun  9 02:27:47 prod-cass23.localdomain cassandra: 2015-06-09 02:27:47,725
 INFO main StorageService.setMode - JOINING: Starting to bootstrap...



 As you can see this happened first time even before Joining. Second
 occasion stack trace:



 Jun  9 02:32:15 prod-cass23.localdomain cassandra: 2015-06-09 02:32:15,097
 ERROR CompactionExecutor:6 CassandraDaemon.uncaughtException - Exception in
 thread Thread[CompactionExecutor:6,1

Re: auto clear data with ttl

2015-06-08 Thread Aiman Parvaiz
So gc_grace zero will remove tombstones without any delay after compaction. So 
it's possible that tombstones containing SSTs still need to be compacted. So 
either you can wait for compaction to happen or do a manual compaction 
depending on your compaction strategy. Manual compaction does have some 
drawbacks so please read about it.

Sent from my iPhone

 On Jun 8, 2015, at 7:26 PM, 曹志富 cao.zh...@gmail.com wrote:
 
 I have C* 2.1.5,store some data with ttl.Reduce the gc_grace_seconds to zero.
 
 But it seems has no effect.
 
 Did I miss something?
 --
 Ranger Tsao


C* 2.0.15 - java.lang.NegativeArraySizeException

2015-06-08 Thread Aiman Parvaiz
Hi everyone
I am running C* 2.0.9 and decided to do a rolling upgrade. Added a node of
C* 2.0.15 in the existing cluster and saw this twice:

Jun  9 02:27:20 prod-cass23.localdomain cassandra: 2015-06-09 02:27:20,658
INFO CompactionExecutor:4 CompactionTask.runMayThrow - Compacting
[SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-37-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-40-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-42-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-38-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-39-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/system/schema_columns/system-schema_columns-jb-44-Data.db')]



Jun  9 02:27:20 prod-cass23.localdomain cassandra: 2015-06-09 02:27:20,669
ERROR CompactionExecutor:4 CassandraDaemon.uncaughtException - Exception in
thread Thread[CompactionExecutor:4,1,main]
Jun  9 02:27:20 prod-cass23.localdomain
*java.lang.NegativeArraySizeException*
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:335)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:462)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:448)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:432)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableReader.getAncestors(SSTableReader.java:1366)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableMetadata.createCollector(SSTableMetadata.java:134)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.db.compaction.CompactionTask.createCompactionWriter(CompactionTask.java:316)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:162)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
Jun  9 02:27:20 prod-cass23.localdomain at
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)
Jun  9 02:27:20 prod-cass23.localdomain at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
Jun  9 02:27:20 prod-cass23.localdomain at
java.util.concurrent.FutureTask.run(FutureTask.java:262)
Jun  9 02:27:20 prod-cass23.localdomain at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Jun  9 02:27:20 prod-cass23.localdomain at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Jun  9 02:27:20 prod-cass23.localdomain at
java.lang.Thread.run(Thread.java:745)
Jun  9 02:27:47 prod-cass23.localdomain cassandra: 2015-06-09 02:27:47,725
INFO main StorageService.setMode - JOINING: Starting to bootstrap...

As you can see this happened first time even before Joining. Second
occasion stack trace:

Jun  9 02:32:15 prod-cass23.localdomain cassandra: 2015-06-09 02:32:15,097
ERROR CompactionExecutor:6 CassandraDaemon.uncaughtException - Exception in
thread Thread[CompactionExecutor:6,1,main]
Jun  9 02:32:15 prod-cass23.localdomain java.lang.NegativeArraySizeException
Jun  9 02:32:15 prod-cass23.localdomain at
org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:335)
Jun  9 02:32:15 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:462)
Jun  9 02:32:15 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:448)
Jun  9 02:32:15 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:432)
Jun  9 02:32:15 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableReader.getAncestors(SSTableReader.java:1366)
Jun  9 02:32:15 prod-cass23.localdomain at
org.apache.cassandra.io.sstable.SSTableMetadata.createCollector(SSTableMetadata.java:134)
Jun  9 02:32:15 prod-cass23.localdomain at

Reading too many tombstones

2015-06-04 Thread Aiman Parvaiz
Hi everyone,
We are running a 10 node Cassandra 2.0.9 without vnode cluster. We are
running in to a issue where we are reading too many tombstones and hence
getting tons of WARN messages and some ERROR query aborted.

cass-prod4 2015-06-04 14:38:34,307 WARN ReadStage:
https://logentries.com/app/9f95dbd4#1998
SliceQueryFilter.collectReducedColumns - Read 46 live and 1560 tombstoned
cells in ABC.home_feed (see tombstone_warn_threshold). 100 columns was
requested, slices= https://logentries.com/app/9f95dbd4#[-], delInfo=
https://logentries.com/app/9f95dbd4#{deletedAt=
https://logentries.com/app/9f95dbd4#-9223372036854775808, localDeletion=
https://logentries.com/app/9f95dbd4#2147483647}

cass-prod2 2015-05-31 12:55:55,331 ERROR ReadStage:
https://logentries.com/app/9f95dbd4#1953
SliceQueryFilter.collectReducedColumns - Scanned over 10 tombstones in
ABC.home_feed; query aborted (see tombstone_fail_threshold)

As you can see all of this is happening for CF home_feed. This CF is
basically maintaining a feed with TTL set to 2592000 (30 days).
gc_grace_seconds for this CF is 864000 and its SizeTieredCompaction.

Repairs have been running regularly and automatic compactions are occurring
normally too.

I can definitely use some help here in how to tackle this issue.

Up till now I have the following ideas:

1) I can make gc_grace_seconds to 0 and then do a manual compaction for
this CF and bump up the gc_grace again.

2) Make gc_grace 0, run manual compaction on this CF and leave gc_grace to
zero. In this case have to be careful in running repairs.

3) I am also considering moving to DateTier Compaction.

What would be the best approach here for my feed case. Any help is
appreciated.

Thanks


Re: Reading too many tombstones

2015-06-04 Thread Aiman Parvaiz
yeah we don't update old data. One thing I am curious about is why are we
running in to so many tombstones with compaction happening normally. Is
compaction not removing tombstomes?

On Thu, Jun 4, 2015 at 11:25 AM, Jonathan Haddad j...@jonhaddad.com wrote:

 DateTiered is fantastic if you've got time series, TTLed data.  That means
 no updates to old data.

 On Thu, Jun 4, 2015 at 10:58 AM Aiman Parvaiz ai...@flipagram.com wrote:

 Hi everyone,
 We are running a 10 node Cassandra 2.0.9 without vnode cluster. We are
 running in to a issue where we are reading too many tombstones and hence
 getting tons of WARN messages and some ERROR query aborted.

 cass-prod4 2015-06-04 14:38:34,307 WARN ReadStage:
 https://logentries.com/app/9f95dbd4#1998
 SliceQueryFilter.collectReducedColumns - Read 46 live and 1560 tombstoned
 cells in ABC.home_feed (see tombstone_warn_threshold). 100 columns was
 requested, slices= https://logentries.com/app/9f95dbd4#[-], delInfo=
 https://logentries.com/app/9f95dbd4#{deletedAt=
 https://logentries.com/app/9f95dbd4#-9223372036854775808,
 localDeletion= https://logentries.com/app/9f95dbd4#2147483647}

 cass-prod2 2015-05-31 12:55:55,331 ERROR ReadStage:
 https://logentries.com/app/9f95dbd4#1953
 SliceQueryFilter.collectReducedColumns - Scanned over 10 tombstones in
 ABC.home_feed; query aborted (see tombstone_fail_threshold)

 As you can see all of this is happening for CF home_feed. This CF is
 basically maintaining a feed with TTL set to 2592000 (30 days).
 gc_grace_seconds for this CF is 864000 and its SizeTieredCompaction.

 Repairs have been running regularly and automatic compactions are
 occurring normally too.

 I can definitely use some help here in how to tackle this issue.

 Up till now I have the following ideas:

 1) I can make gc_grace_seconds to 0 and then do a manual compaction for
 this CF and bump up the gc_grace again.

 2) Make gc_grace 0, run manual compaction on this CF and leave gc_grace
 to zero. In this case have to be careful in running repairs.

 3) I am also considering moving to DateTier Compaction.

 What would be the best approach here for my feed case. Any help is
 appreciated.

 Thanks




Re: Reading too many tombstones

2015-06-04 Thread Aiman Parvaiz
Thanks Carlos for pointing me in that direction, I have some interesting
findings to share. So in December last year there was a redesign of
home_feed and it was migrated to a new CF. Initially all the data in
home_feed had a TTL of 1 year but migrated data was inserted with TTL of
30days.
Now on digging a bit deeper I found that home_feed still has data from Jan
2015 with ttl 1275094 (14 days).

This data is for the same id from home_feed:
 date | ttl(description)
--+--
 2015-04-03 21:22:58+ |   759791
 2015-04-03 04:50:11+ |   412706
 2015-03-30 22:18:58+ |   759791
 2015-03-29 15:20:36+ |  1978689
 2015-03-28 14:41:28+ |  1275116
 2015-03-28 14:31:25+ |  1275116
 2015-03-18 19:23:44+ |  2512936
 2015-03-13 17:51:01+ |  1978689
 2015-02-12 15:41:01+ |  1978689
 2015-01-18 02:36:27+ |  1275094


I am not sure what happened in that migration but I think that when trying
to load data we are reading this old data(as feed queries a 1000/page to be
displayed to the user) and in order to read this data we have to
cross(read) lots of tombstones(newer data has TTL working correctly) and
hence the error.
I am not sure how much would date tier help us in this situation too. If
anyone has any suggestions in how to handle this either on Systems or
Developer level please pitch in.

Thanks

On Thu, Jun 4, 2015 at 11:47 AM, Carlos Rolo r...@pythian.com wrote:

 The TTL data will only be removed after the gc_grace_seconds. So your data
 with 30 days TTL will be still in Cassandra for 10 days more (40 in total).
 Is your data being there for more than that? Otherwise it is expected
 behaviour and probably you should do something on your data model to avoid
 scanning tombstoned data.

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
 www.pythian.com

 On Thu, Jun 4, 2015 at 8:31 PM, Aiman Parvaiz ai...@flipagram.com wrote:

 yeah we don't update old data. One thing I am curious about is why are we
 running in to so many tombstones with compaction happening normally. Is
 compaction not removing tombstomes?


 On Thu, Jun 4, 2015 at 11:25 AM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 DateTiered is fantastic if you've got time series, TTLed data.  That
 means no updates to old data.

 On Thu, Jun 4, 2015 at 10:58 AM Aiman Parvaiz ai...@flipagram.com
 wrote:

 Hi everyone,
 We are running a 10 node Cassandra 2.0.9 without vnode cluster. We are
 running in to a issue where we are reading too many tombstones and hence
 getting tons of WARN messages and some ERROR query aborted.

 cass-prod4 2015-06-04 14:38:34,307 WARN ReadStage:
 https://logentries.com/app/9f95dbd4#1998
 SliceQueryFilter.collectReducedColumns - Read 46 live and 1560 tombstoned
 cells in ABC.home_feed (see tombstone_warn_threshold). 100 columns was
 requested, slices= https://logentries.com/app/9f95dbd4#[-], delInfo=
 https://logentries.com/app/9f95dbd4#{deletedAt=
 https://logentries.com/app/9f95dbd4#-9223372036854775808,
 localDeletion= https://logentries.com/app/9f95dbd4#2147483647}

 cass-prod2 2015-05-31 12:55:55,331 ERROR ReadStage:
 https://logentries.com/app/9f95dbd4#1953
 SliceQueryFilter.collectReducedColumns - Scanned over 10 tombstones in
 ABC.home_feed; query aborted (see tombstone_fail_threshold)

 As you can see all of this is happening for CF home_feed. This CF is
 basically maintaining a feed with TTL set to 2592000 (30 days).
 gc_grace_seconds for this CF is 864000 and its SizeTieredCompaction.

 Repairs have been running regularly and automatic compactions are
 occurring normally too.

 I can definitely use some help here in how to tackle this issue.

 Up till now I have the following ideas:

 1) I can make gc_grace_seconds to 0 and then do a manual compaction for
 this CF and bump up the gc_grace again.

 2) Make gc_grace 0, run manual compaction on this CF and leave gc_grace
 to zero. In this case have to be careful in running repairs.

 3) I am also considering moving to DateTier Compaction.

 What would be the best approach here for my feed case. Any help is
 appreciated.

 Thanks







 --






-- 
Lead Systems Architect
10351 Santa Monica Blvd, Suite 3310
Los Angeles CA 90025


ERROR Compaction Interrupted

2015-06-01 Thread Aiman Parvaiz
Hi everyone,
I am running C* 2.0.9 without vnodes and RF=2. Recently while repairing, 
rebalancing the cluster I encountered one instance of this(just one on one 
node):

ERROR CompactionExecutor: https://logentries.com/app/9f95dbd4#55472 
CassandraDaemon.uncaughtException - Exception in thread 
Thread[CompactionExecutor: https://logentries.com/app/9f95dbd4#55472,1,main]

May 30 19:31:09 cass-prod4.localdomain cassandra: 2015-05-30 19:31:09,991 ERROR 
CompactionExecutor:55472 CassandraDaemon.uncaughtException - Exception in 
thread Thread[CompactionExecutor:55472,1,main]

May 30 19:31:09 cass-prod4.localdomain 
org.apache.cassandra.db.compaction.CompactionInterruptedException: Compaction 
interrupted: Compaction@1b0b43e5-bef5-34f9-af08-405a7b58c71f(flipagram, 
home_feed_entry_index, 218409618/450008574)bytes

May 30 19:31:09 cass-prod4.localdomain at 
org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:157)
May 30 19:31:09 cass-prod4.localdomain at 
org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
May 30 19:31:09 cass-prod4.localdomain at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
May 30 19:31:09 cass-prod4.localdomain at 
org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
May 30 19:31:09 cass-prod4.localdomain at 
org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
May 30 19:31:09 cass-prod4.localdomain at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)
May 30 19:31:09 cass-prod4.localdomain at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
May 30 19:31:09 cass-prod4.localdomain at 
java.util.concurrent.FutureTask.run(FutureTask.java:262)
May 30 19:31:09 cass-prod4.localdomain at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
May 30 19:31:09 cass-prod4.localdomain at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
May 30 19:31:09 cass-prod4.localdomain at 
java.lang.Thread.run(Thread.java:745)

After looking up a bit on the mailing list archives etc I understand that this 
might mean data corruption and I plan to take the node offline and replace it 
with a new one but still wanted to see if anyone can throw some light here 
about me missing out on something.
Also, if this is a case of corrupted SST should I be concerned about it getting 
replicated and take care of it on the replication too.

Thanks

Re: stalled nodetool repair?

2014-08-21 Thread Aiman Parvaiz
If nodetool compactionstats says there are no Validation compactions
running (and the compaction queue is empty)  and netstats says there is
nothing streaming there is a a good chance the repair is finished or dead.

Source:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-it-safe-to-stop-a-read-repair-and-any-suggestion-on-speeding-up-repairs-td6607367.html

You might find this helpful.

Thanks


On Thu, Aug 21, 2014 at 12:32 PM, Kevin Burton bur...@spinn3r.com wrote:

 How do I watch the progress of nodetool repair.

 Looks like the folklore from the list says to just use

 nodetool compactionstats
 nodetool netstats

 … but the repair seems locked/stalled and neither of these are showing any
 progress..

 granted , this is a lot of data, but it would be nice to at least see some
 progress.

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com




Re: EC2 SSD cluster costs

2014-08-19 Thread Aiman Parvaiz
I completely agree with others here. It depends on your use case. We were
using Hi1.4xlarge boxes and paying huge amount to Amazon, lately our
requirements changed and we are not hammering C* as much and our data size
has gone down too, so given the new conditions we reserved and migrated to
c3.4xlarges to save quite a lot of money.


On Aug 19, 2014, at 10:25 AM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

Still using good ol' m1.xlarge here + external caching (memcached). Trying
to adapt our use case to have different clusters for different use cases so
we can leverage SSD at an acceptable cost in some of them.


On Tue, Aug 19, 2014 at 1:05 PM, Shane Hansen shanemhan...@gmail.com
wrote:

 Again, depends on your use case.
 But we wanted to keep the data per node below 500gb,
 and we found raided ssds to be the best bang for the buck
 for our cluster. I think we moved to from the i2 to c3 because
 our bottleneck tended to be CPU utilization (from parsing requests).



 (Discliamer, we're not cassandra veterans but we're not part of the RF=N=3
 club)



 On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

 Short answer, it depends on your use-case.

 We migrated to i2.xlarge nodes and saw an immediate increase in
 performance.  If you just need plain ole raw disk space and don’t have a
 performance requirement to meet then the m1 machines would work, or hell
 even SSD EBS volumes may work for you.  The problem we were having is that
 we couldn’t fill the m1 machines because we needed to add more nodes for
 performance.  Now we have much more power and just the right amount of disk
 space.

 Basically saying, these are not apples-to-apples comparisons



 On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com)
 wrote:

 The latest consensus around the web for running Cassandra on EC2 seems to
 be use new SSD instances. I've not seen any mention of the elephant in
 the room - using the new SSD instances significantly raises the cluster
 cost per TB. With Cassandra's strength being linear scalability to many
 terabytes of data, it strikes me as odd that everyone is recommending such
 a large storage cost hike almost without reservation.

 Monthly cost comparison for a 100TB cluster (non-reserved instances):

 m1.xlarge (2x420 non-SSD): $30,000 (120 nodes)
 m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option)
 i2.xlarge (1x800 SSD): $76,000 (125 nodes)

 Best case, the cost goes up 150%. How are others approaching these new
 instances? Have you migrated and eaten the costs, or are you staying on
 previous generation until prices come down?





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: how do i know if nodetool repair is finished

2014-08-01 Thread Aiman Parvaiz
This is a old post, am not sure if something changed for new C* versions.

If nodetool compactionstats says there are no Validation compactions
running (and the compaction queue is empty)  and netstats says there is
nothing streaming there is a a good chance the repair is finished or dead.
If a neighbour dies during a repair the node it was started on will wait
for 48 hours(?) until it times out. Check the logs on the machines for
errors, particularly from the AntiEntropyService. And see what
compactionstats is saying on all the nodes involved in the repair.

source:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-it-safe-to-stop-a-read-repair-and-any-suggestion-on-speeding-up-repairs-td6607367.html


On Aug 1, 2014, at 2:46 AM, KZ Win kz...@pelotoncycle.com wrote:

I have a 2 node apache cassandra (2.0.3) cluster with rep factor of 1. I
change rep factor to 2 using the following command in cqlsh

ALTER KEYSPACE mykeyspace WITH REPLICATION =   { 'class' :
'SimpleStrategy', 'replication_factor' : 2 };

I then tried to run recommended nodetool repair after doing this type of
alter.

The problem is that this command sometimes finishes very quickly. When it
does finishes like that it will normally say 'Lost notification...' and
exit code is not zero.

So I just repeat this 'nodetool repair' until it finishes without error. I
also check that 'nodetool status' reports expected disk space for each
node. (with rep factor 1, each node has say about 7GB each and I expect
after nodetool repair that each is 14GB each assuming no cluster usage in
the mean time)

Is there a more correct way to determine that 'nodetool repair' is finished
in this case?


Re: VPC AWS

2014-06-05 Thread Aiman Parvaiz
Thanks for this info Michael. As far as restoring node in public VPC is
concerned I was thinking ( and I might be wrong here) if we can have a ring
spread across EC2 and public subnet of a VPC, this way I can simply
decommission nodes in Ec2 as I gradually introduce new nodes in public
subnet of VPC and I will end up with a ring in public subnet and then
migrate them from public to private in a similar way may be.

If anyone has any experience/ suggestions with this please share, would
really appreciate it.

Aiman


On Thu, Jun 5, 2014 at 10:37 AM, Michael Theroux mthero...@yahoo.com
wrote:

 The implementation of moving from EC2 to a VPC was a bit of a juggling
 act.  Our motivation was two fold:

 1) We were running out of static IP addresses, and it was becoming
 increasingly difficult in EC2 to design around limiting the number of
 static IP addresses to the number of public IP addresses EC2 allowed
 2) VPC affords us an additional level of security that was desirable.

 However, we needed to consider the following limitations:

 1) By default, you have a limited number of available public IPs for both
 EC2 and VPC.
 2) AWS security groups need to be configured to allow traffic for
 Cassandra to/from instances in EC2 and the VPC.

 You are correct at the high level that the migration goes from EC2-Public
 VPC (VPC with an Internet Gateway)-Private VPC (VPC with a NAT).  The
 first phase was moving instances to the public VPC, setting broadcast and
 seeds to the public IPs we had available.  Basically:

 1) Take down a node, taking a snapshot for a backup
 2) Restore the node on the public VPC, assigning it to the correct
 security group, manually setting the seeds to other available nodes
 3) Verify the cluster can communicate
 4) Repeat

 Realize the NAT instance on the private subnet will also require a public
 IP.  What got really interesting is that near the end of the process we
 ran out of available IPs, requiring us to switch the final node that was on
 EC2 directly to the private VPC (and taking down two nodes at once, which
 our setup allowed given we had 6 nodes with an RF of 3).

 What we did, and highly suggest for the switch, is to write down every
 step that has to happen on every node during the switch.  In our case, many
 of the moved nodes required slightly different configurations for items
 like the seeds.

 Its been a couple of years, so my memory on this maybe a little fuzzy :)

 -Mike

   --
  *From:* Aiman Parvaiz ai...@shift.com
 *To:* user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com
 *Sent:* Thursday, June 5, 2014 12:55 PM
 *Subject:* Re: VPC AWS

 Michael,
 Thanks for the response, I am about to head in to something very similar
 if not exactly same. I envision things happening on the same lines as you
 mentioned.
 I would be grateful if you could please throw some more light on how you
 went about switching cassandra nodes from public subnet to private with out
 any downtime.
 I have not started on this project yet, still in my research phase. I plan
 to have a ec2+public VPC cluster and then decomission ec2 nodes to have
 everything in public subnet, next would be to move it to private subnet.

 Thanks


 On Thu, Jun 5, 2014 at 8:14 AM, Michael Theroux mthero...@yahoo.com
 wrote:

 We personally use the EC2Snitch, however, we don't have the multi-region
 requirements you do,

 -Mike

   --
  *From:* Alain RODRIGUEZ arodr...@gmail.com
 *To:* user@cassandra.apache.org
 *Sent:* Thursday, June 5, 2014 9:14 AM
 *Subject:* Re: VPC AWS

 I think you can define VPC subnet to be public (to have public + private
 IPs) or private only.

 Any insight regarding snitches ? What snitch do you guys use ?


 2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com:

 I don't think traffic will flow between classic ec2 and vpc directly.
 There is some kind of gateway bridge instance that sits between, acting as
 a NAT.   I would think that would cause new challenges for:
 -transitions
 -clients

 Sorry this response isn't heavy on content!  I'm curious how this thread
 goes...

 Will

 On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi guys,

 We are going to move from a cluster made of simple Amazon EC2 servers to a
 VPC cluster. We are using Cassandra 1.2.11 and I have some questions
 regarding this switch and the Cassandra configuration inside a VPC.

 Actually I found no documentation on this topic, but I am quite sure that
 some people are already using VPC. If you can point me to any documentation
 regarding VPC / Cassandra, it would be very nice of you. We have only one
 DC for now, but we need to remain multi DC compatible, since we will add DC
 very soon.

 Else, I would like to know if I should keep using EC2MultiRegionSnitch or
 change the snitch to anything else.

 What about broadcast/listen ip, seeds...?

 We currently use public ip as for broadcast address and for seeds. We

Re: VPC AWS

2014-06-05 Thread Aiman Parvaiz
Cool, thanks again for this.


On Thu, Jun 5, 2014 at 11:51 AM, Michael Theroux mthero...@yahoo.com
wrote:

 You can have a ring spread across EC2 and the public subnet of a VPC.
  That is how we did our migration.  In our case, we simply replaced the
 existing EC2 node with a new instance in the public VPC, restored from a
 backup taken right before the switch.

 -Mike

   --
  *From:* Aiman Parvaiz ai...@shift.com
 *To:* Michael Theroux mthero...@yahoo.com
 *Cc:* user@cassandra.apache.org user@cassandra.apache.org
 *Sent:* Thursday, June 5, 2014 2:39 PM
 *Subject:* Re: VPC AWS

 Thanks for this info Michael. As far as restoring node in public VPC is
 concerned I was thinking ( and I might be wrong here) if we can have a ring
 spread across EC2 and public subnet of a VPC, this way I can simply
 decommission nodes in Ec2 as I gradually introduce new nodes in public
 subnet of VPC and I will end up with a ring in public subnet and then
 migrate them from public to private in a similar way may be.

 If anyone has any experience/ suggestions with this please share, would
 really appreciate it.

 Aiman


 On Thu, Jun 5, 2014 at 10:37 AM, Michael Theroux mthero...@yahoo.com
 wrote:

 The implementation of moving from EC2 to a VPC was a bit of a juggling
 act.  Our motivation was two fold:

 1) We were running out of static IP addresses, and it was becoming
 increasingly difficult in EC2 to design around limiting the number of
 static IP addresses to the number of public IP addresses EC2 allowed
 2) VPC affords us an additional level of security that was desirable.

 However, we needed to consider the following limitations:

 1) By default, you have a limited number of available public IPs for both
 EC2 and VPC.
 2) AWS security groups need to be configured to allow traffic for
 Cassandra to/from instances in EC2 and the VPC.

 You are correct at the high level that the migration goes from EC2-Public
 VPC (VPC with an Internet Gateway)-Private VPC (VPC with a NAT).  The
 first phase was moving instances to the public VPC, setting broadcast and
 seeds to the public IPs we had available.  Basically:

 1) Take down a node, taking a snapshot for a backup
 2) Restore the node on the public VPC, assigning it to the correct
 security group, manually setting the seeds to other available nodes
 3) Verify the cluster can communicate
 4) Repeat

 Realize the NAT instance on the private subnet will also require a public
 IP.  What got really interesting is that near the end of the process we
 ran out of available IPs, requiring us to switch the final node that was on
 EC2 directly to the private VPC (and taking down two nodes at once, which
 our setup allowed given we had 6 nodes with an RF of 3).

 What we did, and highly suggest for the switch, is to write down every
 step that has to happen on every node during the switch.  In our case, many
 of the moved nodes required slightly different configurations for items
 like the seeds.

 Its been a couple of years, so my memory on this maybe a little fuzzy :)

 -Mike

   --
  *From:* Aiman Parvaiz ai...@shift.com
 *To:* user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com
 *Sent:* Thursday, June 5, 2014 12:55 PM
 *Subject:* Re: VPC AWS

 Michael,
 Thanks for the response, I am about to head in to something very similar
 if not exactly same. I envision things happening on the same lines as you
 mentioned.
 I would be grateful if you could please throw some more light on how you
 went about switching cassandra nodes from public subnet to private with out
 any downtime.
 I have not started on this project yet, still in my research phase. I plan
 to have a ec2+public VPC cluster and then decomission ec2 nodes to have
 everything in public subnet, next would be to move it to private subnet.

 Thanks


 On Thu, Jun 5, 2014 at 8:14 AM, Michael Theroux mthero...@yahoo.com
 wrote:

 We personally use the EC2Snitch, however, we don't have the multi-region
 requirements you do,

 -Mike

   --
  *From:* Alain RODRIGUEZ arodr...@gmail.com
 *To:* user@cassandra.apache.org
 *Sent:* Thursday, June 5, 2014 9:14 AM
 *Subject:* Re: VPC AWS

 I think you can define VPC subnet to be public (to have public + private
 IPs) or private only.

 Any insight regarding snitches ? What snitch do you guys use ?


 2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com:

 I don't think traffic will flow between classic ec2 and vpc directly.
 There is some kind of gateway bridge instance that sits between, acting as
 a NAT.   I would think that would cause new challenges for:
 -transitions
 -clients

 Sorry this response isn't heavy on content!  I'm curious how this thread
 goes...

 Will

 On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi guys,

 We are going to move from a cluster made of simple Amazon EC2 servers to a
 VPC cluster. We are using Cassandra 1.2.11 and I have some

Re: High performance hardware with lot of data per node - Global learning about configuration

2013-07-11 Thread Aiman Parvaiz
Hi,
We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO 
performance is definitely better than the earlier non SSD servers, we are 
serving up to 14k reads/s with a latency of 3-3.5 ms/op. 
I wanted to share our config options and ask about the data back up strategy 
for Raid0.

We are using C* 1.2.6 with

key_chache and row_cache of 300MB
I have not changed/ modified any other parameter except for going with 
multithreaded GC. I will be playing around with other factors and update 
everyone if I find something interesting.

Also, just wanted to share backup strategy and see if I can get something 
useful from how others are taking backup of their raid0. I am using tablesnap 
to upload SSTables to s3 and I have attached a separate EBS volume to every box 
and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would 
really appreciate if you guys can share how you taking backups.

Thanks 


On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi,
 
 Using C*1.2.2.
 
 We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers 
 to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for 
 about the same price.
 
 We tried it after reading some benchmark published by Netflix.
 
 It is awesome and I recommend it to anyone who is using more than 18 xLarge 
 server or can afford these high cost / high performance EC2 instances. SSD 
 gives a very good throughput with an awesome latency.
 
 Yet, we had about 200 GB data per server and now about 1 TB.
 
 To alleviate memory pressure inside the heap I had to reduce the index 
 sampling. I changed the index_interval value from 128 to 512, with no visible 
 impact on latency, but a great improvement inside the heap which doesn't 
 complain about any pressure anymore.
 
 Is there some more tuning I could use, more tricks that could be useful while 
 using big servers, with a lot of data per node and relatively high throughput 
 ?
 
 SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU 
 almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used 
 out of 60GB.
 
 At this point I have kept my previous configuration, which is almost the 
 default one from the Datastax community AMI. There is a part of it, you can 
 consider that any property that is not in here is configured as default :
 
 cassandra.yaml
 
 key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 % and 
 92 %, good enough ?)
 row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and 
 random reads)
 flush_largest_memtables_at: 0.80
 reduce_cache_sizes_at: 0.90
 
 concurrent_reads: 32 (I am thinking to increase this to 64 or more since I 
 have just a few servers to handle more concurrence)
 concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
 memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use 
 bigger value, why for ?)
 
 rpc_server_type: sync (I tried hsha and had the ERROR 12:02:18,971 Read an 
 invalid frame size of 0. Are you using TFramedTransport on the client side? 
 error). No idea how to fix this, and I use 5 different clients for different 
 purpose  (Hector, Cassie, phpCassa, Astyanax, Helenus)...
 
 multithreaded_compaction: false (Should I try enabling this since I now use 
 SSD ?)
 compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or even 
 more)
 
 cross_node_timeout: true
 endpoint_snitch: Ec2MultiRegionSnitch
 
 index_interval: 512
 
 cassandra-env.sh
 
 I am not sure about how to tune the heap, so I mainly use defaults
 
 MAX_HEAP_SIZE=8G
 HEAP_NEWSIZE=400M (I tried with higher values, and it produced bigger GC 
 times (1600 ms instead of  200 ms now with 400M)
 
 -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=8
 -XX:MaxTenuringThreshold=1
 -XX:CMSInitiatingOccupancyFraction=70
 -XX:+UseCMSInitiatingOccupancyOnly
 
 Does this configuration seems coherent ? Right now, performance are correct, 
 latency  5ms almost all the time. What can I do to handle more data per node 
 and keep these performances or get even better once ?
 
 I know this is a long message but if you have any comment or insight even on 
 part of it, don't hesitate to share it. I guess this kind of comment on 
 configuration is usable by the entire community.
 
 Alain
 



Re: High performance hardware with lot of data per node - Global learning about configuration

2013-07-11 Thread Aiman Parvaiz
Thanks for the info Mike, we ran in to a race condition which was killing table 
snap, I want to share the problem and the solution/ work around and may be 
someone can throw some light on the effects of the solution.

tablesnap was getting killed with this error message:

Failed uploading %s. Aborting.\n%s 

Looking at the code it took me to the following:

def worker(self):
bucket = self.get_bucket()

while True:
f = self.fileq.get()
keyname = self.build_keyname(f)
try:
self.upload_sstable(bucket, keyname, f)
except:
self.log.critical(Failed uploading %s. Aborting.\n%s %
 (f, format_exc()))
# Brute force kill self
os.kill(os.getpid(), signal.SIGKILL)

self.fileq.task_done()

It builds the filename and then before it could upload it, the file disappears 
(which is possible), I simply commented out the line which kills tablesnap if 
the file is not found, it fixes the issue we were having but I would appreciate 
if some one has any insights on any ill effects this might have on backup or 
restoration process.

Thanks


On Jul 11, 2013, at 7:03 AM, Mike Heffner m...@librato.com wrote:

 We've also noticed very good read and write latencies with the hi1.4xls 
 compared to our previous instance classes. We actually ran a mixed cluster of 
 hi1.4xls and m2.4xls to watch side-by-side comparison.
 
 Despite the significant improvement in underlying hardware, we've noticed 
 that streaming performance with 1.2.6+vnodes is a lot slower than we would 
 expect. Bootstrapping a node into a ring with large storage loads can take 6+ 
 hours. We have a JIRA open that describes our current config: 
 https://issues.apache.org/jira/browse/CASSANDRA-5726
 
 Aiman: We also use tablesnap for our backups. We're using a slightly modified 
 version [1]. We currently backup every sst as soon as they hit disk 
 (tablesnap's inotify), but we're considering moving to a periodic snapshot 
 approach as the sst churn after going from 24 nodes - 6 nodes is quite high.
 
 Mike
 
 
 [1]: https://github.com/librato/tablesnap
 
 
 On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz ai...@grapheffect.com wrote:
 Hi,
 We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO 
 performance is definitely better than the earlier non SSD servers, we are 
 serving up to 14k reads/s with a latency of 3-3.5 ms/op.
 I wanted to share our config options and ask about the data back up strategy 
 for Raid0.
 
 We are using C* 1.2.6 with
 
 key_chache and row_cache of 300MB
 I have not changed/ modified any other parameter except for going with 
 multithreaded GC. I will be playing around with other factors and update 
 everyone if I find something interesting.
 
 Also, just wanted to share backup strategy and see if I can get something 
 useful from how others are taking backup of their raid0. I am using tablesnap 
 to upload SSTables to s3 and I have attached a separate EBS volume to every 
 box and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would 
 really appreciate if you guys can share how you taking backups.
 
 Thanks
 
 
 On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 
  Hi,
 
  Using C*1.2.2.
 
  We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) 
  servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers 
  instead, for about the same price.
 
  We tried it after reading some benchmark published by Netflix.
 
  It is awesome and I recommend it to anyone who is using more than 18 xLarge 
  server or can afford these high cost / high performance EC2 instances. SSD 
  gives a very good throughput with an awesome latency.
 
  Yet, we had about 200 GB data per server and now about 1 TB.
 
  To alleviate memory pressure inside the heap I had to reduce the index 
  sampling. I changed the index_interval value from 128 to 512, with no 
  visible impact on latency, but a great improvement inside the heap which 
  doesn't complain about any pressure anymore.
 
  Is there some more tuning I could use, more tricks that could be useful 
  while using big servers, with a lot of data per node and relatively high 
  throughput ?
 
  SSD are at 20-40 % of their throughput capacity (according to OpsCenter), 
  CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM 
  used out of 60GB.
 
  At this point I have kept my previous configuration, which is almost the 
  default one from the Datastax community AMI. There is a part of it, you can 
  consider that any property that is not in here is configured as default :
 
  cassandra.yaml
 
  key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 % 
  and 92 %, good enough ?)
  row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and 
  random reads)
  flush_largest_memtables_at: 0.80
  reduce_cache_sizes_at: 0.90

Populating seeds dynamically

2013-06-03 Thread Aiman Parvaiz
Hi all
I am using puppet to push cassandra.yaml file which has seeds node hardcoded, 
going forward I don't want to hard code the seed nodes and I plan to maintain a 
list of seed nodes. Since I have a cluster in place I would populate this list 
for now to start with and next time when I add a node this list would be 
referred and three nodes would be read and populated as seeds in the yaml file.

This implementation can lead to different nodes running different seeds I know 
that this is not a ideal situation but I believe that if a node has been in the 
ring for long enough(say 10 minutes, it knows about other nodes in the ring) 
then it  can be used as a seed node.

What do you guys think of populating seeds this way and also please throw some 
light on why running different seeds is not a best practice(assuming that all 
potential seed candidates have been in ring long enough)

Thanks

Re: Populating seeds dynamically

2013-06-03 Thread Aiman Parvaiz
@Faraaz check out the comment by Aaron morton here : 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Seed-Nodes-td6077958.html
Having same nodes is a good idea but it is not necessary.
 In your case, sure the nodes will be in the cluster for 10
 minutes but what about sporadic failures that cause them to leave the ring and
 then re-enter it? At that point, you might reach the network fragmentation
 issue.

I am not sure I understand this completely, if a node leaves the ring and re 
enters it would use its seed nodes to know about the ring and these nodes would 
be the ones which are part of ring so I don't see any information lag happening 
here.

On Jun 3, 2013, at 5:06 PM, Faraaz Sareshwala fsareshw...@quantcast.com wrote:

 All the documentation that I have read about cassanrda always says to keep the
 same list of seeds on every node in the cluster. Without this, you can end up
 with fragmentation within your cluster where nodes don't know about other 
 nodes
 in the cluster. In your case, sure the nodes will be in the cluster for 10
 minutes but what about sporadic failures that cause them to leave the ring and
 then re-enter it? At that point, you might reach the network fragmentation
 issue.
 
 I also use puppet to push out the cassandra.yaml file. I've defined the list 
 of
 seeds in my puppet class and have puppet generate the cassandra.yaml file from
 an erb template.
 
 Hopefully that helps a bit :).
 
 Faraaz
 
 On Mon, Jun 03, 2013 at 04:59:23PM -0700, Aiman Parvaiz wrote:
 Hi all
 I am using puppet to push cassandra.yaml file which has seeds node 
 hardcoded, going forward I don't want to hard code the seed nodes and I plan 
 to maintain a list of seed nodes. Since I have a cluster in place I would 
 populate this list for now to start with and next time when I add a node 
 this list would be referred and three nodes would be read and populated as 
 seeds in the yaml file.
 
 This implementation can lead to different nodes running different seeds I 
 know that this is not a ideal situation but I believe that if a node has 
 been in the ring for long enough(say 10 minutes, it knows about other nodes 
 in the ring) then it  can be used as a seed node.
 
 What do you guys think of populating seeds this way and also please throw 
 some light on why running different seeds is not a best practice(assuming 
 that all potential seed candidates have been in ring long enough)
 
 Thanks



Re: Cassandra performance decreases drastically with increase in data size.

2013-05-31 Thread Aiman Parvaiz
I believe you should roll out more nodes as a temporary fix to your problem, 
400GB on all nodes means (as correctly mentioned in other mails of this thread) 
you are spending more time on GC. Check out the second comment in this link by 
Aaron Morton, he says the more than 300GB can be problematic, though this post 
is about older version of cassandra but I believe concept still stands true:

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-it-safe-to-stop-a-read-repair-and-any-suggestion-on-speeding-up-repairs-td6607367.html

Thanks

On May 29, 2013, at 9:32 PM, srmore comom...@gmail.com wrote:

 Hello,
 I am observing that my performance is drastically decreasing when my data 
 size grows. I have a 3 node cluster with 64 GB of ram and my data size is 
 around 400GB on all the nodes. I also see that when I re-start Cassandra the 
 performance goes back to normal and then again starts decreasing after some 
 time. 
 
 Some hunting landed me to this page 
 http://wiki.apache.org/cassandra/LargeDataSetConsiderations which talks about 
 the large data sets and explains that it might be because I am going through 
 multiple layers of OS cache, but does not tell me how to tune it.
 
 So, my question is, are there any optimizations that I can do to handle these 
 large datatasets ?
 
 and why does my performance go back to normal when I restart Cassandra ?
 
 Thanks !



Re: Cassandra running High Load with no one using the cluster

2013-05-06 Thread Aiman Parvaiz
Correction, there was a typo in my original question, we are running cassandra 
1.1.10

Thanks and sorry for the inconvenience.
On May 6, 2013, at 10:23 AM, Robert Coli rc...@eventbrite.com wrote:

 including non-working Hinted Handoff



Cassandra running High Load with no one using the cluster

2013-05-04 Thread Aiman Parvaiz
Since last night I am seeing CPU load spikes on our cassandra
boxes(Occasionally load goes up to 20, its a Amazon EC2 c1.xlarge with 300
iops EBS). After digging around a little I believe its related to heap
memory and flushing memtables.

From logs:
WARN 03:22:03,414 Heap is 0.7786981388910019 full.  You may need to reduce
memtable and/or cache sizes.  Cassandra will now flush up to the two
largest memtables to free up memory.  Adjust flush_largest_memtables_at
threshold in cassandra.yaml if you don't want Cassandra to do this
automatically

WARN 03:22:03,415 Flushing CFS(Keyspace='XXX', ColumnFamily='') to
relieve memory pressure

I have three nodes and only 2 of them are hitting this high load, moreover
cluster is under extremely light load, no one is using it since yesterday
and I still see this load.

I also observed that `top -H` showed many threads in Sleep state and only a
handful in R state. `nodetool cfstat` showed the following for the
ColumnFamily in the above stated cassandra logs:

  Column Family: 
SSTable count: 8
Space used (live): 1479005837
Space used (total): 1479005837
Number of Keys (estimate): 2923008
Memtable Columns Count: 35375
Memtable Data Size: 7088479
Memtable Switch Count: 2393
Read Count: 2339668632
Read Latency: 3.042 ms.
Write Count: 360448535
Write Latency: 0.079 ms.
Pending Tasks: 0
Bloom Filter False Positives: 143197
Bloom Filter False Ratio: 0.73004
Bloom Filter Space Used: 7142048
Compacted row minimum size: 73
Compacted row maximum size: **785939**
Compacted row mean size: 1957

`Compacted Row maximum size` for other ColumFamily is significantly less
than this number.

When starting this cluster we set
 JVM_OPTS=$JVM_OPTS -Xss1000k

We are using cassandra 1.1.0 and open-6-jdk

Can any one please help me understand whey with no load on the system I am
still seeing such high load on my machines.

Thanks