Re: Nodetool ring and Replicas after 1.2 upgrade

2015-06-16 Thread Michael Theroux
Thanks Jason,
No errors in the log.  Also the nodes do have a consistent schema for the 
keyspace (although this was a problem during the upgrade that we resolved using 
the procedure specified here: 
https://wiki.apache.org/cassandra/FAQ#schema_disagreement).  
-Mike  
  From: Jason Wee peich...@gmail.com
 To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com 
 Sent: Tuesday, June 16, 2015 12:07 AM
 Subject: Re: Nodetool ring and Replicas after 1.2 upgrade
   
maybe check the system.log to see if there is any exception and/or error? check 
as well if they are having consistent schema for the keyspace?
hth
jason


On Tue, Jun 16, 2015 at 7:17 AM, Michael Theroux mthero...@yahoo.com wrote:

Hello,
We (finally) have just upgraded from Cassandra 1.1 to Cassandra 1.2.19.  
Everything appears to be up and running normally, however, we have noticed 
unusual output from nodetool ring.  There is a new (to us) field Replicas in 
the nodetool output, and this field, seemingly at random, is changing from 2 to 
3 and back to 2.
We are using the byte ordered partitioner (we hash our own keys), and have a 
replication factor of 3.  We are also on AWS and utilize the Ec2snitch on a 
single Datacenter.  
Other calls appear to be normal.  nodetool getEndpoints returns the proper 
endpoints when querying various keys, nodetool ring and status return that all 
nodes appear healthy.  
Anyone have any hints on what maybe happening, or if this is a problem we 
should be concerned with?
Thanks,-Mike




  

Re: Nodetool ring and Replicas after 1.2 upgrade

2015-06-16 Thread Michael Theroux
After looking at the cassandra code a little, I believe this is not really an 
issue.
After the upgrade to 1.2, we still see the issue described in this bug I filed:
https://issues.apache.org/jira/browse/CASSANDRA-5264

The Replicas is calculated by adding up the effective ownership of all the 
nodes, and chopping off the remainder.  So if your effective ownership is 
299.99%, it appears the code will will report the number of replicas as 2.  
This might become reliably 3 after I complete running repairs after upgrade.
Thanks for you time,-Mike
  From: Alain RODRIGUEZ arodr...@gmail.com
 To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com 
 Sent: Tuesday, June 16, 2015 4:43 PM
 Subject: Re: Nodetool ring and Replicas after 1.2 upgrade
   
Hi Michael,
I barely can access internet right now and was not able to check outputs on my 
computer, yet first thing that come to my mind is that since 1.2.x (and vnodes) 
I use rather nodetool status instead. What is the nodetool status output ?
Also did you try to specify the keyspace ? Since RF is a per keyspace value, 
maybe this would help.
Other than that, I don't have any idea. I don't remember anything similar, but 
it was a while ago. I have to ask... Why staying so much behind the current 
stable / production ready version ?
C*heers,
Alain
2015-06-16 14:57 GMT+02:00 Michael Theroux mthero...@yahoo.com:



Thanks Jason,
No errors in the log.  Also the nodes do have a consistent schema for the 
keyspace (although this was a problem during the upgrade that we resolved using 
the procedure specified here: 
https://wiki.apache.org/cassandra/FAQ#schema_disagreement).  
-Mike  
  From: Jason Wee peich...@gmail.com
 To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com 
 Sent: Tuesday, June 16, 2015 12:07 AM
 Subject: Re: Nodetool ring and Replicas after 1.2 upgrade
   
maybe check the system.log to see if there is any exception and/or error? check 
as well if they are having consistent schema for the keyspace?
hth
jason


On Tue, Jun 16, 2015 at 7:17 AM, Michael Theroux mthero...@yahoo.com wrote:

Hello,
We (finally) have just upgraded from Cassandra 1.1 to Cassandra 1.2.19.  
Everything appears to be up and running normally, however, we have noticed 
unusual output from nodetool ring.  There is a new (to us) field Replicas in 
the nodetool output, and this field, seemingly at random, is changing from 2 to 
3 and back to 2.
We are using the byte ordered partitioner (we hash our own keys), and have a 
replication factor of 3.  We are also on AWS and utilize the Ec2snitch on a 
single Datacenter.  
Other calls appear to be normal.  nodetool getEndpoints returns the proper 
endpoints when querying various keys, nodetool ring and status return that all 
nodes appear healthy.  
Anyone have any hints on what maybe happening, or if this is a problem we 
should be concerned with?
Thanks,-Mike




   



  

Nodetool ring and Replicas after 1.2 upgrade

2015-06-15 Thread Michael Theroux
Hello,
We (finally) have just upgraded from Cassandra 1.1 to Cassandra 1.2.19.  
Everything appears to be up and running normally, however, we have noticed 
unusual output from nodetool ring.  There is a new (to us) field Replicas in 
the nodetool output, and this field, seemingly at random, is changing from 2 to 
3 and back to 2.
We are using the byte ordered partitioner (we hash our own keys), and have a 
replication factor of 3.  We are also on AWS and utilize the Ec2snitch on a 
single Datacenter.  
Other calls appear to be normal.  nodetool getEndpoints returns the proper 
endpoints when querying various keys, nodetool ring and status return that all 
nodes appear healthy.  
Anyone have any hints on what maybe happening, or if this is a problem we 
should be concerned with?
Thanks,-Mike


Re: VPC AWS

2014-06-05 Thread Michael Theroux
Hello Alain,

We switched from EC2 to VPC a couple of years ago.  The process for us was 
long, slow, and multi step for our (at the time) 6 node cluster.

In our case, we don't need to consider multi-DC.  However, in our 
infrastructure we were rapidly running out of IP addresses, and wished to move 
to VPC to give us a nearly inexhaustible supply.  In addition, AWS VPC gives us 
an additional layer of security for our Cassandra cluster. 

To do this, we setup our VPC to have both private and public subnets.  Public 
subnets were accessible to the Internet (when instances were assigned a public 
IP), while private subnets could not (although instances on the subnet could 
access the Internet via a NAT instance).  We wished for to be Cassandra on the 
private subnet.  However, this introduced a complication.  EC2 instances would 
not be able to communicate directly to our VPC instances on a private subnet. 

So, to achieve this, while still having an operating Cassandra DB without 
downtime, we essentially had to stage Cassandra instances on our public subnet, 
assigning IPs and reconfiguring nodes until we had a mixed EC2/VPC Public 
subnet cluster, then start moving systems to the private subnet, continuing the 
process until all instances were on a private subnet.  During the process we 
carefully orchestrated configuration like broadcast and seeds to make sure the 
cluster continued to function properly and all nodes could communicate with 
each other.  We also had to carefully orchestrate the assigning of AWS security 
groups to make sure everyone could talk to each other during this process.

Also keep in mind that the use of public IPs for communications will add to 
your AWS costs.  During our transition we had to do this for a short time while 
EC2 instances were communicating with VPC instances, but we were able to switch 
to 100% internal IPs when we completed (you will still get inter availability 
zone charges regardless)

This process was complex enough that I wrote detailed series of steps, for each 
node in our cluster.

-Mike
 


 From: Alain RODRIGUEZ arodr...@gmail.com
To: user@cassandra.apache.org 
Sent: Thursday, June 5, 2014 8:12 AM
Subject: VPC AWS
 


Hi guys,

We are going to move from a cluster made of simple Amazon EC2 servers to a VPC 
cluster. We are using Cassandra 1.2.11 and I have some questions regarding this 
switch and the Cassandra configuration inside a VPC.

Actually I found no documentation on this topic, but I am quite sure that some 
people are already using VPC. If you can point me to any documentation 
regarding VPC / Cassandra, it would be very nice of you. We have only one DC 
for now, but we need to remain multi DC compatible, since we will add DC very 
soon.

Else, I would like to know if I should keep using EC2MultiRegionSnitch or 
change the snitch to anything else.

What about broadcast/listen ip, seeds...?

We currently use public ip as for broadcast address and for seeds. We use 
private ones for listen address. Machines inside the VPC will only have private 
IP AFAIK. Should I keep using a broadcast address ?

Is there any other incidence when switching to a VPC ?

Sorry if the topic was already discussed, I was unable to find any useful 
information...

Re: VPC AWS

2014-06-05 Thread Michael Theroux
Hello Alain,

We switched from EC2 to VPC a couple of years ago.  The process for us was 
long, slow and multi step.

In our case, we don't need to consider multi-DC.  However, in our 
infrastructure we were rapidly running out of IP addresses, and wished to move 
to VPC to give us a nearly inexhaustible supply.  In addition, AWS VPC gives us 
an additional layer of security for our Cassandra cluster. 

To do this, we setup our VPC to have both private and public subnets.  Public 
subnets were accessible to the Internet (when instances were assigned a public 
IP, while private subnets could not (although instances on the subnet could 
access the Internet via a NAT instance).  We wished for Cassandra on the 
private subnet. 

So, to achieve this, while still having an operating Cassandra DB without 
downtime, we essentially had to stage Cassandra instances on our public subnet, 
assigning IPs and reconfiguring nodes until we had a mixed EC2/VPC Public 
cluster, then start moving systems to the private subnet, continuing the 
process until all instances were on a private subnet.  During the process we 
carefully orchestrated configuration like broadcast and seeds to make sure the 
cluster continued to function properly.  We also had to orchestrate the 
assigning of AWS security groups to make sure everyone could talk to each other 
during this process.

Also keep in mind that the use of public IPs for communications will add to 
your AWS costs.  During our transition we had to do this for a short time while 
EC2 instances were communicating with VPC instances, but we were able to switch 
to 100% internal IPs when we completed (you will still get inter availability 
zone charges regardless)

In order to make this successful, I created a script out


 From: Alain RODRIGUEZ arodr...@gmail.com
To: user@cassandra.apache.org 
Sent: Thursday, June 5, 2014 8:12 AM
Subject: VPC AWS
 


Hi guys,

We are going to move from a cluster made of simple Amazon EC2 servers to a VPC 
cluster. We are using Cassandra 1.2.11 and I have some questions regarding this 
switch and the Cassandra configuration inside a VPC.

Actually I found no documentation on this topic, but I am quite sure that some 
people are already using VPC. If you can point me to any documentation 
regarding VPC / Cassandra, it would be very nice of you. We have only one DC 
for now, but we need to remain multi DC compatible, since we will add DC very 
soon.

Else, I would like to know if I should keep using EC2MultiRegionSnitch or 
change the snitch to anything else.

What about broadcast/listen ip, seeds...?

We currently use public ip as for broadcast address and for seeds. We use 
private ones for listen address. Machines inside the VPC will only have private 
IP AFAIK. Should I keep using a broadcast address ?

Is there any other incidence when switching to a VPC ?

Sorry if the topic was already discussed, I was unable to find any useful 
information...

Re: VPC AWS

2014-06-05 Thread Michael Theroux
We personally use the EC2Snitch, however, we don't have the multi-region 
requirements you do,

-Mike



 From: Alain RODRIGUEZ arodr...@gmail.com
To: user@cassandra.apache.org 
Sent: Thursday, June 5, 2014 9:14 AM
Subject: Re: VPC AWS
 


I think you can define VPC subnet to be public (to have public + private IPs) 
or private only.

Any insight regarding snitches ? What snitch do you guys use ?



2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com:

I don't think traffic will flow between classic ec2 and vpc directly. There 
is some kind of gateway bridge instance that sits between, acting as a NAT.   I 
would think that would cause new challenges for:

-transitions 
-clients

Sorry this response isn't heavy on content!  I'm curious how this thread 
goes...


Will


On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote:

Hi guys,


We are going to move from a cluster made of simple Amazon EC2 servers to a 
VPC cluster. We are using Cassandra 1.2.11 and I have some questions 
regarding this switch and the Cassandra configuration inside a VPC.


Actually I found no documentation on this topic, but I am quite sure that 
some people are already using VPC. If you can point me to any documentation 
regarding VPC / Cassandra, it would be very nice of you. We have only one DC 
for now, but we need to remain multi DC compatible, since we will add DC very 
soon.


Else, I would like to know if I should keep using EC2MultiRegionSnitch or 
change the snitch to anything else.


What about broadcast/listen ip, seeds...?


We currently use public ip as for broadcast address and for seeds. We use 
private ones for listen address. Machines inside the VPC will only have 
private IP AFAIK. Should I keep using a broadcast address ?


Is there any other incidence when switching to a VPC ?


Sorry if the topic was already discussed, I was unable to find any useful 
information...

-- 
Will Oberman
Civic Science, Inc.
6101 Penn Avenue, Fifth Floor
Pittsburgh, PA 15206
(M) 412-480-7835
(E) ober...@civicscience.com


Re: VPC AWS

2014-06-05 Thread Michael Theroux
The implementation of moving from EC2 to a VPC was a bit of a juggling act.  
Our motivation was two fold:


1) We were running out of static IP addresses, and it was becoming increasingly 
difficult in EC2 to design around limiting the number of static IP addresses to 
the number of public IP addresses EC2 allowed
2) VPC affords us an additional level of security that was desirable.

However, we needed to consider the following limitations:

1) By default, you have a limited number of available public IPs for both EC2 
and VPC.  
2) AWS security groups need to be configured to allow traffic for Cassandra 
to/from instances in EC2 and the VPC.

You are correct at the high level that the migration goes from EC2-Public VPC 
(VPC with an Internet Gateway)-Private VPC (VPC with a NAT).  The first phase 
was moving instances to the public VPC, setting broadcast and seeds to the 
public IPs we had available.  Basically:

1) Take down a node, taking a snapshot for a backup

2) Restore the node on the public VPC, assigning it to the correct security 
group, manually setting the seeds to other available nodes
3) Verify the cluster can communicate
4) Repeat

Realize the NAT instance on the private subnet will also require a public IP.  
What got really interesting is that near the end of the process we ran out of 
available IPs, requiring us to switch the final node that was on EC2 directly 
to the private VPC (and taking down two nodes at once, which our setup allowed 
given we had 6 nodes with an RF of 3).  

What we did, and highly suggest for the switch, is to write down every step 
that has to happen on every node during the switch.  In our case, many of the 
moved nodes required slightly different configurations for items like the seeds.

Its been a couple of years, so my memory on this maybe a little fuzzy :)

-Mike



 From: Aiman Parvaiz ai...@shift.com
To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com 
Sent: Thursday, June 5, 2014 12:55 PM
Subject: Re: VPC AWS
 


Michael, 
Thanks for the response, I am about to head in to something very similar if not 
exactly same. I envision things happening on the same lines as you mentioned. 
I would be grateful if you could please throw some more light on how you went 
about switching cassandra nodes from public subnet to private with out any 
downtime.
I have not started on this project yet, still in my research phase. I plan to 
have a ec2+public VPC cluster and then decomission ec2 nodes to have everything 
in public subnet, next would be to move it to private subnet.

Thanks



On Thu, Jun 5, 2014 at 8:14 AM, Michael Theroux mthero...@yahoo.com wrote:

We personally use the EC2Snitch, however, we don't have the multi-region 
requirements you do,


-Mike




 
From: Alain RODRIGUEZ arodr...@gmail.com
To: user@cassandra.apache.org 
Sent: Thursday, June 5, 2014 9:14 AM
Subject: Re: VPC AWS



I think you can define VPC subnet to be public (to have public + private IPs) 
or private only.


Any insight regarding snitches ? What snitch do you guys use ?



2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com:

I don't think traffic will flow between classic ec2 and vpc directly. There 
is some kind of gateway bridge instance that sits between, acting as a NAT.   
I would think that would cause new challenges for:

-transitions 
-clients

Sorry this response isn't heavy on content!  I'm curious how this thread 
goes...


Will


On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote:

Hi guys,


We are going to move from a cluster made of simple Amazon EC2 servers to a 
VPC cluster. We are using Cassandra 1.2.11 and I have some questions 
regarding this switch and the Cassandra configuration inside a VPC.


Actually I found no documentation on this topic, but I am quite sure that 
some people are already using VPC. If you can point me to any documentation 
regarding VPC / Cassandra, it would be very nice of you. We have only one DC 
for now, but we need to remain multi DC compatible, since we will add DC 
very soon.


Else, I would like to know if I should keep using EC2MultiRegionSnitch or 
change the snitch to anything else.


What about broadcast/listen ip, seeds...?


We currently use public ip as for broadcast address and for seeds. We use 
private ones for listen address. Machines inside the VPC will only have 
private IP AFAIK. Should I keep using a broadcast address ?


Is there any other incidence when switching to a VPC ?


Sorry if the topic was already discussed, I was unable to find any useful 
information...

-- 
Will Oberman
Civic Science, Inc.
6101 Penn Avenue, Fifth Floor
Pittsburgh, PA 15206
(M) 412-480-7835
(E) ober...@civicscience.com





Re: VPC AWS

2014-06-05 Thread Michael Theroux
You can have a ring spread across EC2 and the public subnet of a VPC.  That is 
how we did our migration.  In our case, we simply replaced the existing EC2 
node with a new instance in the public VPC, restored from a backup taken right 
before the switch.

-Mike



 From: Aiman Parvaiz ai...@shift.com
To: Michael Theroux mthero...@yahoo.com 
Cc: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Thursday, June 5, 2014 2:39 PM
Subject: Re: VPC AWS
 


Thanks for this info Michael. As far as restoring node in public VPC is 
concerned I was thinking ( and I might be wrong here) if we can have a ring 
spread across EC2 and public subnet of a VPC, this way I can simply 
decommission nodes in Ec2 as I gradually introduce new nodes in public subnet 
of VPC and I will end up with a ring in public subnet and then migrate them 
from public to private in a similar way may be.

If anyone has any experience/ suggestions with this please share, would really 
appreciate it.

Aiman



On Thu, Jun 5, 2014 at 10:37 AM, Michael Theroux mthero...@yahoo.com wrote:

The implementation of moving from EC2 to a VPC was a bit of a juggling act.  
Our motivation was two fold:



1) We were running out of static IP addresses, and it was becoming 
increasingly difficult in EC2 to design around limiting the number of static 
IP addresses to the number of public IP addresses EC2 allowed
2) VPC affords us an additional level of security that was desirable.


However, we needed to consider the following limitations:


1) By default, you have a limited number of available public IPs for both EC2 
and VPC.  
2) AWS security groups need to be configured to allow traffic for Cassandra 
to/from instances in EC2 and the VPC.


You are correct at the high level that the migration goes from EC2-Public VPC 
(VPC with an Internet Gateway)-Private VPC (VPC with a NAT).  The first phase 
was moving instances to the public VPC, setting broadcast and seeds to the 
public IPs we had available.  Basically:


1) Take down a node, taking a snapshot for a backup

2) Restore the node on the public VPC, assigning it to the correct security 
group, manually setting the seeds to other available nodes
3) Verify the cluster can communicate
4) Repeat


Realize the NAT instance on the private subnet will also require a public IP.  
What got really interesting is that near the end of the process we ran out of 
available IPs, requiring us to switch the final node that was on EC2 directly 
to the private VPC (and taking down two nodes at once, which our setup allowed 
given we had 6 nodes with an RF of 3).  


What we did, and highly suggest for the switch, is to write down every step 
that has to happen on every node during the switch.  In our case, many of the 
moved nodes required slightly different configurations for items like the 
seeds.


Its been a couple of years, so my memory on this maybe a little fuzzy :)


-Mike




 From: Aiman Parvaiz ai...@shift.com
To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com 
Sent: Thursday, June 5, 2014 12:55 PM
Subject: Re: VPC AWS
 


Michael, 
Thanks for the response, I am about to head in to something very similar if 
not exactly same. I envision things happening on the same lines as you 
mentioned. 
I would be grateful if you could please throw some more light on how you went 
about switching cassandra nodes from public subnet to private with out any 
downtime.
I have not started on this project yet, still in my research phase. I plan to 
have a ec2+public VPC cluster and then decomission ec2 nodes to have 
everything in public subnet, next would be to move it to private subnet.


Thanks



On Thu, Jun 5, 2014 at 8:14 AM, Michael Theroux mthero...@yahoo.com wrote:

We personally use the EC2Snitch, however, we don't have the multi-region 
requirements you do,


-Mike




 
From: Alain RODRIGUEZ arodr...@gmail.com
To: user@cassandra.apache.org 
Sent: Thursday, June 5, 2014 9:14 AM
Subject: Re: VPC AWS



I think you can define VPC subnet to be public (to have public + private IPs) 
or private only.


Any insight regarding snitches ? What snitch do you guys use ?



2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com:

I don't think traffic will flow between classic ec2 and vpc directly. There 
is some kind of gateway bridge instance that sits between, acting as a NAT.   
I would think that would cause new challenges for:

-transitions 
-clients

Sorry this response isn't heavy on content!  I'm curious how this thread 
goes...


Will


On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote:

Hi guys,


We are going to move from a cluster made of simple Amazon EC2 servers to a 
VPC cluster. We are using Cassandra 1.2.11 and I have some questions 
regarding this switch and the Cassandra configuration inside a VPC.


Actually I found no documentation on this topic, but I am quite

Re: cassandra backup

2013-12-06 Thread Michael Theroux
Hi Marcelo,

Cassandra provides and eventually consistent model for backups.  You can do 
staggered backups of data, with the idea that if you restore a node, and then 
do a repair, your data will be once again consistent.  Cassandra will not 
automatically copy the data to other nodes (other than via hinted handoff).  
You should manually run repair after restoring a node.
  
You should take snapshots when doing a backup, as it keeps the data you are 
backing up relevant to a single point in time, otherwise compaction could 
add/delete files one you mid-backup, or worse, I imagine attempt to access a 
SSTable mid-write.  Snapshots work by using links, and don't take additional 
storage to perform.  In our process we create the snapshot, perform the backup, 
and then clear the snapshot.

One thing to keep in mind in your S3 cost analysis is that, even though storage 
is cheap, reads/writes to S3 are not (especially writes).  If you are using 
LeveledCompaction, or otherwise have a ton of SSTables, some people have 
encountered increased costs moving the data to S3.

Ourselves, we maintain backup EBS volumes that we regularly snaphot/rsync data 
too.  Thus far this has worked very well for us.

-Mike



On Friday, December 6, 2013 8:14 AM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:
 
Hello everyone,

    I am trying to create backups of my data on AWS. My goal is to store the 
backups on S3 or glacier, as it's cheap to store this kind of data. So, if I 
have a cluster with N nodes, I would like to copy data from all N nodes to S3 
and be able to restore later. I know Priam does that (we were using it), but I 
am using the latest cassandra version and we plan to use DSE some time, I am 
not sure Priam fits this case.
    I took a look at the docs: 
http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html
 
    And I am trying to understand if it's really needed to take a snapshot to 
create my backup. Suppose I do a flush and copy the sstables from each node, 1 
by one, to s3. Not all at the same time, but one by one. 
    When I try to restore my backup, data from node 1 will be older than data 
from node 2. Will this cause problems? AFAIK, if I am using a replication 
factor of 2, for instance, and Cassandra sees data from node X only, it will 
automatically copy it to other nodes, right? Is there any chance of cassandra 
nodes become corrupt somehow if I do my backups this way?

Best regards,
Marcelo Valle.

Re: Compaction issues

2013-10-23 Thread Michael Theroux
One more note,

When we did this conversion, we were on Cassandra 1.1.X.  You didn't mention 
what version of Cassandra you were running,

Thanks,
-Mike

On Oct 23, 2013, at 10:05 AM, Michael Theroux wrote:

 When we made a similar move, for an unknown reason (I didn't hear any 
 feedback from the list when I asked why this might be), compaction didn't 
 start after we moved from SizedTiered to leveled compaction until I ran 
 nodetool compact keyspace column-family-converted-to-lcs.
 
 The thread is here:
 
 http://www.mail-archive.com/user@cassandra.apache.org/msg27726.html
 
 I've also seen other individuals on this list state that those pending 
 compaction stats didn't move unless the node was restarted.  Compaction 
 started to run several minutes after restart.
 
 Thanks,
 -Mike
 
 On Oct 23, 2013, at 9:14 AM, Russ Garrett wrote:
 
 Hi,
 
 We have a cluster which we've recently moved to use
 LeveledCompactionStrategy. We were experiencing some disk space
 issues, so we added two additional nodes temporarily to aid
 compaction. Once the compaction had completed on all nodes, we
 decommissioned the two temporary nodes.
 
 All nodes now have a high number of pending tasks which isn't dropping
 - they're remaining approximately static. There are constantly
 compaction tasks running, but when they complete, the pending tasks
 number doesn't drop. We've set the compaction rate limit to 0, and
 increased the number of compactor threads until the I/O utilisation is
 at maximum, but neither of these have helped.
 
 Any suggestions?
 
 Cheers,
 
 -- 
 Russ Garrett
 r...@garrett.co.uk
 



Re: DELETE does not delete :)

2013-10-17 Thread Michael Theroux
A couple questions:

1) How did you determine that the record is deleted on only one node? Are you 
looking for tombstones, or the original entry that was inserted? Note that when 
an item is deleted, the original entry can still be in an SSTABLE somewhere, 
and the tombstone can be in another SSTABLE until those tables are compacted 
together.

2) When you did the global daily check, are you sure you are not getting range 
ghosts? I assume they are still possible on 2.0 
(http://www.datastax.com/docs/0.7/getting_started/using_cli, search for range 
ghosts).

Thanks,
-Mike




On Thursday, October 17, 2013 6:36 AM, Alexander Shutyaev shuty...@gmail.com 
wrote:
 
Hi Daniel, Nate.

Thanks for your answers. We have gc_grace_seconds=864000 (which is the default, 
I believe). We've also checked the clocks - they are synchronized.



2013/10/16 Nate McCall n...@thelastpickle.com

This is almost a guaranteed sign that the clocks are off in your cluster. If 
you run the select query a couple of times in a row right after deletion, do 
you see the data appear again?



On Wed, Oct 16, 2013 at 12:12 AM, Alexander Shutyaev shuty...@gmail.com 
wrote:

Hi all,


Unfortunately, we still have a problem. I've modified my code, so that it 
explicitly sets the consistency level to QUORUM for each query. However, we 
found out a few cases when the record is deleted on only 1 node of 3. In this 
cases the delete query executed ok, and the select query that we do right 
after delete returned 0 rows. Later when we ran a global daily check select 
returned 1 row. How can that be? What can we be missing?



2013/10/7 Jon Haddad j...@jonhaddad.com

I haven't used VMWare but it seems odd that it would lock up the ntp port.  
try ps aux | grep ntp to see if ntpd it's already running.


On Oct 7, 2013, at 12:23 AM, Alexander Shutyaev shuty...@gmail.com wrote:

Hi Michał,


I didn't notice your message at first.. Well this seems like a real cause 
candidate.. I'll add an explicit consistency level QUORUM and see if that 
helps. Thanks



2013/10/7 Alexander Shutyaev shuty...@gmail.com

Hi Nick,

Thanks for the note! We have our cassanra instances installed on virtual 
hosts in VMWare and the clock synchronization is handled by the latter, so 
I can't use ntpdate (says that NTP socket is in use). Is there any way to 
check if the clocks are really synchronized? My best attempt was using 
three shell windows with commands already typed thus requiring only 
clicking on the window and hitting enter. The results varied by 100-200 
msec which I guess is just about the time I need to click and press enter 
:)


Thanks in advance,
Alexander



2013/10/7 Nikolay Mihaylov n...@nmmm.nu

Hi


my two cents - before doing anything else, make sure clocks are 
synchronized to the millisecond.
ntp will do so.


Nick.



On Mon, Oct 7, 2013 at 9:02 AM, Alexander Shutyaev shuty...@gmail.com 
wrote:

Hi all,


We have encountered the following problem with cassandra.


* We use cassandra v2.0.0 from Datastax community repo.


* We have 3 nodes in a cluster, all of them are seed providers.


* We have a single keyspace with replication factor = 3:


CREATE KEYSPACE bof WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'
};


* We use Datastax Java CQL Driver v1.0.3 in our application.


* We have not modified any consistency settings in our app, so I assume 
we have the default QUORUM (2 out of 3 in our case) consistency for 
reads and writes.


* We have 400+ tables which can be divided in two groups (main and 
uids). All tables in a group have the same definition, they vary only by 
name. The sample definitions are:


CREATE TABLE bookingfile (
  key text,
  entity_created timestamp,
  entity_createdby text,
  entity_entitytype text,
  entity_modified timestamp,
  entity_modifiedby text,
  entity_status text,
  entity_uid text,
  entity_updatepolicy text,
  version_created timestamp,
  version_createdby text,
  version_data blob,
  version_dataformat text,
  version_datasource text,
  version_modified timestamp,
  version_modifiedby text,
  version_uid text,
  version_versionnotes text,
  version_versionnumber int,
  versionscount int,
  PRIMARY KEY (key)
) WITH
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.10 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='NONE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};


CREATE TABLE bookingfile_uids (
  date text,
  timeanduid text,
  deleted boolean,
  PRIMARY KEY (date, timeanduid)
) WITH
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  

Reverse compaction on 1.1.11?

2013-09-19 Thread Michael Theroux
Hello,

Quick question.  Is there a tool that allows sstablesplit (reverse compaction) 
against 1.1.11 sstables?  I seem to recall a separate utility somewhere, but 
I'm having difficulty locating it,

Thanks,
-Mike

Issue with leveled compaction and data migration

2013-09-13 Thread Michael Theroux
Hello,

We've been undergoing a migration on Cassandra 1.1.9 where we are combining two 
column families.  We are incrementally moving data from one column family into 
another, where the columns in a row in the source column family are being 
appended to columns in a row in the target column family.  Both column families 
are using leveled compaction, and both column families have over 100 million 
rows.  

However, our bloom filters on the target column family grow dramatically (less 
than double) after converting less than 1/4 of the data.  I assume this is 
because new changes are not being compacted with older changes, although I 
thought leveled compaction would mitigate this for me. Any advice on what we 
can do to control our bloom filter growth during this migration?

Appreciate the help,
Thanks,
-Mike

Temporarily slow nodes on Cassandra

2013-09-02 Thread Michael Theroux
Hello,

We are experiencing an issue where nodes a temporarily slow due to I/O 
contention anywhere from 10 minutes to 2 hours.  I don't believe this slowdown 
is Cassandra related, but factors outside of Cassandra.  We run Cassandra 
1.1.9.  We run a 12 node cluster, with a replication factor of 3, and all 
queries use LOCAL_QUORUM consistency.

Our problem is (other than the contention issue, which we are working on), when 
this one node slows down, the whole system performance appears to slow down.  
Is there a way in Cassandra to accommodate or mitigate slower nodes?  Shutting 
down the node in question during the period of contention does resolve the 
performance problem, but is there anything in cassandra that can assist this 
situation while we resolve the hardware problem?

Thanks,
-Mike

TTL, Tombstones, and gc_grace

2013-07-25 Thread Michael Theroux
Hello,

Quick question on Cassandra, TTLs, tombstones, and GC grace.  If we have a 
column family whose only mechanism of deleting columns is utilizing TTLs, is 
repair really necessary to make tombstones consistent, and therefore would it 
be safe to set the gc grace period of the column family to a very low value?

I ask because of this blog post based on Cassandra .7: 
http://www.datastax.com/dev/blog/whats-new-cassandra-07-expiring-columns.

The first time the expired column is compacted, it is transformed into a 
tombstone. This transformation frees some disk space: the size of the value of 
the expired column. From that moment on, the column is a normal tombstone and 
follows the tombstone rules: it will be totally removed by compaction 
(including minor ones in most cases since Cassandra 0.6.6) after 
GCGraceSeconds.

Since tombstones are not written using a replicated write, but instead written 
during compaction, theoretically, it shouldn't be possible to lose a tombstone? 
 Or is this blog post inaccurate for later versions of cassandra?  We are using 
cassandra 1.1.11.

Thanks,
-Mike




Re: Deletion use more space.

2013-07-16 Thread Michael Theroux
The only time information is removed from the filesystem is during compaction.  
Compaction can remove tombstones after gc_grace_seconds, which, could result in 
reanimation of deleted data if the tombstone was never properly replicated to 
other replicas.  Repair will make sure tombstones are consistent amongst 
replicas.  However, tombstones can not be removed if the data the tombstone is 
deleting is in another SSTable and has not yet been removed. 

Hope this helps,
-Mike

  
On Jul 16, 2013, at 10:04 AM, Andrew Bialecki wrote:

 I don't think setting gc_grace_seconds to an hour is going to do what you'd 
 expect. After gc_grace_seconds, if you haven't run a repair within that hour, 
 the data you deleted will seem to have been undeleted.
 
 Someone correct me if I'm wrong, but in order to order to completely delete 
 data and regain the space it takes up, you need to delete it, which creates 
 tombstones, and then run a repair on that column family within 
 gc_grace_seconds. After that the data is actually gone and the space 
 reclaimed.
 
 
 On Tue, Jul 16, 2013 at 6:20 AM, 杨辉强 huiqiangy...@yunrang.com wrote:
 Thank you!
 It should be update column family ScheduleInfoCF with gc_grace = 3600;
 Faint.
 
 - 原始邮件 -
 发件人: 杨辉强 huiqiangy...@yunrang.com
 收件人: user@cassandra.apache.org
 发送时间: 星期二, 2013年 7 月 16日 下午 6:15:12
 主题: Re: Deletion use more space.
 
 Hi,
   I use the follow cmd to update gc_grace_seconds. It reports error! Why?
 
 [default@WebSearch] update column family ScheduleInfoCF with gc_grace_seconds 
 = 3600;
 java.lang.IllegalArgumentException: No enum const class 
 org.apache.cassandra.cli.CliClient$ColumnFamilyArgument.GC_GRACE_SECONDS
 
 
 - 原始邮件 -
 发件人: Michał Michalski mich...@opera.com
 收件人: user@cassandra.apache.org
 发送时间: 星期二, 2013年 7 月 16日 下午 5:51:49
 主题: Re: Deletion use more space.
 
 Deletion is not really removing data, but it's adding tombstones
 (markers) of deletion. They'll be later merged with existing data during
 compaction and - in the end (see: gc_grace_seconds) - removed, but by
 this time they'll take some space.
 
 http://wiki.apache.org/cassandra/DistributedDeletes
 
 M.
 
 W dniu 16.07.2013 11:46, 杨辉强 pisze:
  Hi, all:
 I use cassandra 1.2.4 and I have 4 nodes ring and use byte order 
  partitioner.
 I had inserted about 200G data in the ring previous days.
 
 Today I write a program to scan the ring and then at the same time 
  delete the items that are scanned.
 To my surprise, the cassandra cost more disk usage.
 
  Anybody can tell me why? Thanks.
 
 



Re: Alternate major compaction

2013-07-11 Thread Michael Theroux
Information is only deleted from Cassandra during a compaction.  Using 
SizeTieredCompaction, compaction only occurs when a number of similarly sized 
sstables are combined into a new sstable.  

When you perform a major compaction, all sstables are combined into one, very 
large, sstable.  As a result, any tombstoned data in that large sstable will 
only be removed when a number of very large sstable exists.  This means 
tombstoned data maybe trapped in that sstable for a very long time (or 
indefinitely depending on your usecase).

-Mike

On Jul 11, 2013, at 9:31 AM, Brian Tarbox wrote:

 Perhaps I should already know this but why is running a major compaction 
 considered so bad?  We're running 1.1.6.
 
 Thanks.
 
 
 On Thu, Jul 11, 2013 at 7:51 AM, Takenori Sato ts...@cloudian.com wrote:
 Hi,
 
 I think it is a common headache for users running a large Cassandra cluster 
 in production.
 
 
 Running a major compaction is not the only cause, but more. For example, I 
 see two typical scenario.
 
 1. backup use case
 2. active wide row
 
 In the case of 1, say, one data is removed a year later. This means, 
 tombstone on the row is 1 year away from the original row. To remove an 
 expired row entirely, a compaction set has to include all the rows. So, when 
 do the original, 1 year old row, and the tombstoned row are included in a 
 compaction set? It is likely to take one year.
 
 In the case of 2, such an active wide row exists in most of sstable files. 
 And it typically contains many expired columns. But none of them wouldn't be 
 removed entirely because a compaction set practically do not include all the 
 row fragments.
 
 
 Btw, there is a very convenient MBean API is available. It is 
 CompactionManager's forceUserDefinedCompaction. You can invoke a minor 
 compaction on a file set you define. So the question is how to find an 
 optimal set of sstable files.
 
 Then, I wrote a tool to check garbage, and print outs some useful information 
 to find such an optimal set.
 
 Here's a simple log output.
 
 # /opt/cassandra/bin/checksstablegarbage -e 
 /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db
 [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
 300(1373504071)]
 ===
 ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
 REMAINNING_SSTABLE_FILES
 ===
 hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db
 ---
 TOTAL, 40, 40
 ===
 REMAINNING_SSTABLE_FILES means any other sstable files that contain the 
 respective row. So, the following is an optimal set.
 
 # /opt/cassandra/bin/checksstablegarbage -e 
 /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db 
 /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db 
 [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
 300(1373504131)]
 ===
 ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
 REMAINNING_SSTABLE_FILES
 ===
 hello5/100.txt.1373502926003, 223, 0, YES, YES
 ---
 TOTAL, 223, 0
 ===
 This tool relies on SSTableReader and an aggregation iterator as Cassandra 
 does in compaction. I was considering to share this with the community. So 
 let me know if anyone is interested.
 
 Ah, note that it is based on 1.0.7. So I will need to check and update for 
 newer versions.
 
 Thanks,
 Takenori
 
 
 On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez tomas.nu...@groupalia.com 
 wrote:
 Hi
 
 About a year ago, we did a major compaction in our cassandra cluster (a n00b 
 mistake, I know), and since then we've had huge sstables that never get 
 compacted, and we were condemned to repeat the major compaction process every 
 once in a while (we are using SizeTieredCompaction strategy, and we've not 
 avaluated yet LeveledCompaction, because it has its downsides, and we've had 
 no time to test all of them in our environment).
 
 I was trying to find a way to solve this situation (that is, do something 
 like a major compaction that writes small sstables, not huge as major 
 compaction does), and I couldn't find it in the documentation. I tried 
 cleanup and scrub/upgradesstables, but they don't do that (as documentation 
 states). Then I tried deleting all data in a node and then bootstrapping it 
 (or nodetool rebuild-ing it), hoping that this way the sstables would get 
 cleaned from deleted records and updates. But the deleted node just copied 
 the 

Repair of tombstones

2013-05-19 Thread Michael Theroux
There has been a lot of discussion on the list recently concerning issues with 
repair, runtime, etc.

We recently have had issues with this cassandra bug:

https://issues.apache.org/jira/browse/CASSANDRA-4905

Basically, if you do regular staggered repairs, and you have tombstones that 
can be gc_graced, those tombstones may never be cleaned up if those tombstones 
don't get compacted away before the next repair.  This is because these 
tombstones are essentially recopied to other nodes during the next repair.  
This has been fixed in 1.2, however, we aren't ready to make the jump to 1.2 
yet.

Is there a reason why this hasn't been back-ported to 1.1?  Is it a risky 
change? Although not a silver bullet, it seems it may help a lot of people with 
repair issues (certainly seems it would help us),

-Mike



CQL Clarification

2013-04-28 Thread Michael Theroux
Hello,

Just wondering if I can get a quick clarification on some simple CQL.  We 
utilize Thrift CQL Queries to access our cassandra setup.  As clarified in a 
previous question I had, when using CQL and Thrift, timestamps on the cassandra 
column data is assigned by the server, not the client, unless AND TIMESTAMP 
is utilized in the query, for example:

http://www.datastax.com/docs/1.0/references/cql/UPDATE

According to the Datastax documentation, this timestamp should be:

Values serialized with the timestamp type are encoded as 64-bit signed 
integers representing a number of milliseconds since the standard base time 
known as the epoch: January 1 1970 at 00:00:00 GMT.

However, my testing showed that updates didn't work when I used a timestamp of 
this format.  Looking at the Cassandra code, it appears that cassandra will 
assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp is not 
specified, which would be the number of nanoseconds since the stand base time.  
In my test environment, setting the timestamp to be the current time * 1000 
seems to work.  It seems that if you have an older installation without 
TIMESTAMP being specified in the CQL,   or a mixed environment, the timestamp 
should be * 1000.

Just making sure I'm reading everything properly... improperly setting the 
timestamp could cause us some serious damage.

Thanks,
-Mike




Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
Hello,

We've done some additional monitoring, and I think we have more information.  
We've been collecting vmstat information every minute, attempting to catch  a 
node with issues,.

So, it appears, that the cassandra node runs fine.  Then suddenly, without any 
correlation to any event that I can identify, the I/O wait time goes way up, 
and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
snapshots and backups) start causing large I/O Wait times when they typically 
would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
with very few blocked processes on I/O.  Once this issue manifests itself, i/O 
wait times for the same activities jump to 30-40% with many blocked processes.  
The I/O wait times do go back down when there is literally no activity.   

-  Updating the node to the latest Amazon Linux patches and rebooting the 
instance doesn't correct the issue.
-  Backing up the node, and replacing the instance does correct the issue.  I/O 
wait times return to normal.

One relatively recent change we've made is we upgraded to m1.xlarge instances 
which has 4 ephemeral drives available.  We create a logical volume from the 4 
drives with the idea that we should be able to get increased I/O throughput.  
When we ran m1.large instances, we had the same setup, although it was only 
using 2 ephemeral drives.  We chose to use LVM, vs. madm because we were having 
issues having madm create the raid volume reliably on restart (and research 
showed that this was a common problem).  LVM just worked (and had worked for 
months before this upgrade)..

For reference, this is the script we used to create the logical volume:

vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
blockdev --setra 65536 /dev/mnt_vg/mnt_lv
sleep 2
mkfs.xfs /dev/mnt_vg/mnt_lv
sleep 3
mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
sleep 3

Another tidbit... thus far (and this maybe only a coincidence), we've only had 
to replace DB nodes within a single availability zone within us-east.  Other 
availability zones, in the same region, have yet to show an issue.

It looks like I'm going to need to replace a third DB node today.  Any advice 
would be appreciated.

Thanks,
-Mike


On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:

 Thanks.
 
 We weren't monitoring this value when the issue occurred, and this particular 
 issue has not appeared for a couple of days (knock on wood).  Will keep an 
 eye out though,
 
 -Mike
 
 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
 
 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at about 
 the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one 
 node, our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would 
 see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 
 



Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
I forgot to mention,

When things go really bad, I'm seeing I/O waits in the 80-95% range.  I 
restarted cassandra once when a node is in this situation, and it took 45 
minutes to start (primarily reading SSTables).  Typically, a node would start 
in about 5 minutes.

Thanks,
-Mike
 
On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:

 Hello,
 
 We've done some additional monitoring, and I think we have more information.  
 We've been collecting vmstat information every minute, attempting to catch  a 
 node with issues,.
 
 So, it appears, that the cassandra node runs fine.  Then suddenly, without 
 any correlation to any event that I can identify, the I/O wait time goes way 
 up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
 snapshots and backups) start causing large I/O Wait times when they typically 
 would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
 with very few blocked processes on I/O.  Once this issue manifests itself, 
 i/O wait times for the same activities jump to 30-40% with many blocked 
 processes.  The I/O wait times do go back down when there is literally no 
 activity.   
 
 -  Updating the node to the latest Amazon Linux patches and rebooting the 
 instance doesn't correct the issue.
 -  Backing up the node, and replacing the instance does correct the issue.  
 I/O wait times return to normal.
 
 One relatively recent change we've made is we upgraded to m1.xlarge instances 
 which has 4 ephemeral drives available.  We create a logical volume from the 
 4 drives with the idea that we should be able to get increased I/O 
 throughput.  When we ran m1.large instances, we had the same setup, although 
 it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm because 
 we were having issues having madm create the raid volume reliably on restart 
 (and research showed that this was a common problem).  LVM just worked (and 
 had worked for months before this upgrade)..
 
 For reference, this is the script we used to create the logical volume:
 
 vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
 lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
 blockdev --setra 65536 /dev/mnt_vg/mnt_lv
 sleep 2
 mkfs.xfs /dev/mnt_vg/mnt_lv
 sleep 3
 mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
 sleep 3
 
 Another tidbit... thus far (and this maybe only a coincidence), we've only 
 had to replace DB nodes within a single availability zone within us-east.  
 Other availability zones, in the same region, have yet to show an issue.
 
 It looks like I'm going to need to replace a third DB node today.  Any advice 
 would be appreciated.
 
 Thanks,
 -Mike
 
 
 On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
 
 Thanks.
 
 We weren't monitoring this value when the issue occurred, and this 
 particular issue has not appeared for a couple of days (knock on wood).  
 Will keep an eye out though,
 
 -Mike
 
 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
 
 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at 
 about the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one 
 node, our cassandra client appears to see this, reporting errors.  We 
 use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients 
 would see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 
 
 



Re: CQL Clarification

2013-04-28 Thread Michael Theroux
Yes, that does help,

So, in the link I provided:

http://www.datastax.com/docs/1.0/references/cql/UPDATE

It states:

You can specify these options:

Consistency level
Time-to-live (TTL)
Timestamp for the written columns.

Where timestamp is a link to Working with dates and times and mentions the 
64bit millisecond value.  Is that incorrect?

-Mike

On Apr 28, 2013, at 11:42 AM, Michael Theroux wrote:

 Hello,
 
 Just wondering if I can get a quick clarification on some simple CQL.  We 
 utilize Thrift CQL Queries to access our cassandra setup.  As clarified in a 
 previous question I had, when using CQL and Thrift, timestamps on the 
 cassandra column data is assigned by the server, not the client, unless AND 
 TIMESTAMP is utilized in the query, for example:
 
 http://www.datastax.com/docs/1.0/references/cql/UPDATE
 
 According to the Datastax documentation, this timestamp should be:
 
 Values serialized with the timestamp type are encoded as 64-bit signed 
 integers representing a number of milliseconds since the standard base time 
 known as the epoch: January 1 1970 at 00:00:00 GMT.
 
 However, my testing showed that updates didn't work when I used a timestamp 
 of this format.  Looking at the Cassandra code, it appears that cassandra 
 will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp 
 is not specified, which would be the number of nanoseconds since the stand 
 base time.  In my test environment, setting the timestamp to be the current 
 time * 1000 seems to work.  It seems that if you have an older installation 
 without TIMESTAMP being specified in the CQL,   or a mixed environment, the 
 timestamp should be * 1000.
 
 Just making sure I'm reading everything properly... improperly setting the 
 timestamp could cause us some serious damage.
 
 Thanks,
 -Mike
 
 



Re: Really odd issue (AWS related?)

2013-04-26 Thread Michael Theroux
Thanks.

We weren't monitoring this value when the issue occurred, and this particular 
issue has not appeared for a couple of days (knock on wood).  Will keep an eye 
out though,

-Mike

On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:

 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at about 
 the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one node, 
 our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would 
 see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 



Re: Really odd issue (AWS related?)

2013-04-25 Thread Michael Theroux
Sorry, Not sure what CPU steal is :)

I have AWS console with detailed monitoring enabled... things seem to track 
close to the minute, so I can see the CPU load go to 0... then jump at about 
the minute Cassandra reports the dropped messages,

-Mike

On Apr 25, 2013, at 9:50 PM, aaron morton wrote:

 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote:
 Another related question.  Once we see messages being dropped on one node, 
 our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see 
 an error?  If only one node reports an error, shouldn't the consistency 
 level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 



Really odd issue (AWS related?)

2013-04-24 Thread Michael Theroux
Hello,

Since Sunday, we've been experiencing a really odd issue in our Cassandra 
cluster.  We recently started receiving errors that messages are being dropped. 
 But here is the odd part...

When looking in the AWS console, instead of seeing statistics being elevated 
during this time, we actually see all statistics suddenly drop right before 
these messages appear.  CPU, I/O, and network go way down.  In fact, in one 
case, they went to 0 for about 5 minutes to the point that other cassandra 
nodes saw this specific node in question as being down.  The messages appear 
right after the node wakes up.

We've had this happen on 3 different nodes on three different days since Sunday.

Other facts:

- We recently upgraded from m1.large to m1.xlarge instances about two weeks ago.
- We are running Cassandra 1.1.9
- We've been doing some memory tuning, although I have seen this happen on 
untuned nodes.

Has anyone seen anything like this before?

Another related question.  Once we see messages being dropped on one node, our 
cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM 
with a RF of 3 on all queries.  Any idea why clients would see an error?  If 
only one node reports an error, shouldn't the consistency level prevent the 
client from seeing an issue?

Thanks for your help,
-Mike

Re: Advice on memory warning

2013-04-24 Thread Michael Theroux
 [ScheduledTasks:1] 2013-04-23 16:40:30,845 StatusLogger.java (line 112) 
 system.batchlog   0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,845 StatusLogger.java (line 112) 
 system.NodeIdInfo 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) 
 system.LocationInfo   0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) 
 system.Schema 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) 
 system.Migrations 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) 
 system.schema_keyspaces   0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) 
 system.schema_columns 0,0
 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) 
 system.schema_columnfamilies 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) 
 system.IndexInfo  0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) 
 system.range_xfers0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) 
 system.peer_events0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) 
 system.hints  0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) 
 system.HintsColumnFamily  0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) 
 x.foo 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) 
 x.foo2 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) 
 x.foo3 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) 
 x.foo4 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) 
 x.foo5  0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) 
 x.foo6 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) 
 x.foo7 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) 
 system_auth.users 0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) 
 system_traces.sessions0,0
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) 
 system_traces.events  0,0
  WARN [ScheduledTasks:1] 2013-04-23 16:40:30,850 GCInspector.java (line 142) 
 Heap is 0.824762725573964 full.  You may need to reduce memtable and/or cache 
 sizes.  Cassandra will now flush up to the two largest memtables to free up 
 memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you 
 don't want Cassandra to do this automatically
  INFO [ScheduledTasks:1] 2013-04-23 16:40:30,850 StorageService.java (line 
 3537) Unable to reduce heap usage since there are no dirty column families
 
 
 
 
 On 23 April 2013 16:52, Ralph Goers ralph.go...@dslextreme.com wrote:
 We are using DSE, which I believe is also 1.1.9.  We have basically had a 
 non-usable cluster for months due to this error.  In our case, once it starts 
 doing this it starts flushing sstables to disk and eventually fills up the 
 disk to the point where it can't compact.  If we catch it soon enough and 
 restart the node it usually can recover.
 
 In our case, the heap size is 12 GB. As I understand it Cassandra will give 
 1/3 of that for sstables. I then noticed that we have one column family that 
 is using nearly 4GB in bloom filters on each node.  Since the nodes will 
 start doing this when the heap reaches 9GB we essentially only have 1GB of 
 free memory so when compactions, cleanups, etc take place this situation 
 starts happening.  We are working to change our data model to try to resolve 
 this.
 
 Ralph
 
 On Apr 19, 2013, at 8:00 AM, Michael Theroux wrote:
 
  Hello,
 
  We've recently upgraded from m1.large to m1.xlarge instances on AWS to 
  handle additional load, but to also relieve memory pressure.  It appears to 
  have accomplished both, however, we are still getting a warning, 0-3 times 
  a day, on our database nodes:
 
  WARN [ScheduledTasks:1] 2013-04-19 14:17:46,532 GCInspector.java (line 145) 
  Heap is 0.7529240824406468 full.  You may need to reduce memtable and/or 
  cache sizes.  Cassandra will now flush up to the two largest memtables to 
  free up memory.  Adjust flush_largest_memtables_at threshold in 
  cassandra.yaml if you don't want Cassandra to do this automatically
 
  This is happening much less frequently than before the upgrade, but after 
  essentially

Re: Moving cluster

2013-04-21 Thread Michael Theroux
I believe the two solutions that are being referred to is the lift and shift 
vs. upgrading by replacing a node and letting it restore from the cluster.

I don't think there are any more risks per-say on the upgrading by replacing, 
as long as you can make sure your new node is configured properly.  One might 
choose to do lift-and-shift in order to have a node down for less time 
(depending on your individual situation), or to have less of an impact on the 
cluster, as replacing a node would result in other nodes streaming their data 
to the newly replaced node.  Depending on your dataset, this could take quite 
some time.

All this also assumes, of course, that you are replicating your data such that 
the new node can retrieve the information it is responsible for from the other 
nodes.

Thanks,
-Mike


On Apr 21, 2013, at 4:18 PM, aaron morton wrote:

 Sorry i do not understand you question. What are the two solutions ? 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 20/04/2013, at 3:43 AM, Kais Ahmed k...@neteck-fr.com wrote:
 
 Hello and thank you for your answers.
 
 The first solution is much easier for me because I use the vnode.
 
 What is the risk of the first solution
 
 thank you,
 
 
 2013/4/18 aaron morton aa...@thelastpickle.com
 This is roughly the lift and shift process I use. 
 
 Note that disabling thrift and gossip does not stop an existing repair 
 session. So I often drain and then shutdown, and copy the live data dir 
 rather than a snapshot dir. 
 
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/04/2013, at 4:10 AM, Michael Theroux mthero...@yahoo.com wrote:
 
 This should work.  
 
 Another option is to follow a process similar to what we recently did.  We 
 recently and successfully upgraded 12 instances from large to xlarge 
 instances in AWS.  I chose not to replace nodes as restoring data from the 
 ring would have taken significant time and put the cluster under some 
 additional load.  I also wanted to eliminate the possibility that any 
 issues on the new nodes could be blamed on new configuration/operating 
 system differences.  Instead we followed the following procedure (removing 
 some details that would likely be unique to our infrastructure).
 
 For a node being upgraded:
 
 1) nodetool disable thrift 
 2) nodetool disable gossip
 3) Snapshot the data (nodetool snapshot ...)
 4) Backup the snapshot data to EBS (assuming you are on ephemeral)
 5) Stop cassandra
 6) Move the cassandra.yaml configuration file to cassandra.yaml.bak (to 
 prevent any future restarts to cause cassandra to restart)
 7) Shutdown the instance
 8) Take an AMI of the instance
 9) Start a new instance from the AMI with the desired hardware
 10) If you assign the new instance a new IP Address, make sure any entries 
 in /etc/hosts, or the broadcast_address in cassandra.yaml is updated
 11) Attach the volume you backed up your snapshot data to to the new 
 instance and mount it
 12) Restore the snapshot data
 13) Restore cassandra.yaml file
 13) Restart cassandra
 
 - I recommend practicing this on a test cluster first
 - As you replace nodes with new IP Addresses, eventually all your seeds 
 will need be updated.  This is not a big deal until all your seed nodes 
 have been replaced.
 - Don't forget about NTP!  Make sure it is running on all your new nodes.  
 Myself, to be extra careful, I actually deleted the ntp drift file and let 
 NTP recalculate it because its a new instance, and it took over an hour to 
 restore our snapshot data... but that may have been overkill.
 - If you have the opportunity, depending on your situation, increase the 
 max_hint_window_in_ms
 - Your details may vary
 
 Thanks,
 -Mike
 
 On Apr 18, 2013, at 11:07 AM, Alain RODRIGUEZ wrote:
 
 I would say add your 3 servers to the 3 tokens where you want them, let's 
 say :
 
 {
 0: {
 0: 0,
 1: 56713727820156410577229101238628035242,
 2: 113427455640312821154458202477256070485
 }
 }
 
 or these token -1 or +1 if you already have these token used. And then 
 just decommission x1Large nodes. You should be good to go.
 
 
 
 2013/4/18 Kais Ahmed k...@neteck-fr.com
 Hi,
 
 What is the best pratice to move from a cluster of 7 nodes (m1.xlarge) to 
 3 nodes (hi1.4xlarge).
 
 Thanks,
 
 
 
 
 



Advice on memory warning

2013-04-19 Thread Michael Theroux
Hello,

We've recently upgraded from m1.large to m1.xlarge instances on AWS to handle 
additional load, but to also relieve memory pressure.  It appears to have 
accomplished both, however, we are still getting a warning, 0-3 times a day, on 
our database nodes:

WARN [ScheduledTasks:1] 2013-04-19 14:17:46,532 GCInspector.java (line 145) 
Heap is 0.7529240824406468 full.  You may need to reduce memtable and/or cache 
sizes.  Cassandra will now flush up to the two largest memtables to free up 
memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you 
don't want Cassandra to do this automatically

This is happening much less frequently than before the upgrade, but after 
essentially doubling the amount of available memory, I'm curious on what I can 
do to determine what is happening during this time.  

I am collecting all the JMX statistics.  Memtable space is elevated but not 
extraordinarily high.  No GC messages are being output to the log.   

These warnings do seem to be occurring doing compactions of column families 
using LCS with wide rows, but I'm not sure there is a direct correlation.

We are running Cassandra 1.1.9, with a maximum heap of 8G.  

Any advice?
Thanks,
-Mike

Re: CQL

2013-04-19 Thread Michael Theroux
A lot more details on your usecase and requirements would help.  You need to 
make specific considerations in cassandra when you have requirements around 
ordering.  Ordering can be achieved across columns.  Ordering across rows is a 
bit more tricky and may require the use of specific partitioners...

I did a real quick google search on cassandra ordering and found some good 
links (http://ayogo.com/blog/sorting-in-cassandra/).

To do queries with ordering across columns, using the FIRST keyword will 
return results based on the comparator you defined in your schema:

http://cassandra.apache.org/doc/cql/CQL.html#SELECT

Hope this helps,
-Mike

On Apr 19, 2013, at 7:32 AM, Sri Ramya wrote:

 hi,
 
  I am working with CQl. I want perform query base on timestamp. Can anyone 
 help me out of this how to get date great than or less than a given time 
 timestamp in cassandra.



Re: Moving cluster

2013-04-18 Thread Michael Theroux
This should work.  

Another option is to follow a process similar to what we recently did.  We 
recently and successfully upgraded 12 instances from large to xlarge instances 
in AWS.  I chose not to replace nodes as restoring data from the ring would 
have taken significant time and put the cluster under some additional load.  I 
also wanted to eliminate the possibility that any issues on the new nodes could 
be blamed on new configuration/operating system differences.  Instead we 
followed the following procedure (removing some details that would likely be 
unique to our infrastructure).

For a node being upgraded:

1) nodetool disable thrift 
2) nodetool disable gossip
3) Snapshot the data (nodetool snapshot ...)
4) Backup the snapshot data to EBS (assuming you are on ephemeral)
5) Stop cassandra
6) Move the cassandra.yaml configuration file to cassandra.yaml.bak (to prevent 
any future restarts to cause cassandra to restart)
7) Shutdown the instance
8) Take an AMI of the instance
9) Start a new instance from the AMI with the desired hardware
10) If you assign the new instance a new IP Address, make sure any entries in 
/etc/hosts, or the broadcast_address in cassandra.yaml is updated
11) Attach the volume you backed up your snapshot data to to the new instance 
and mount it
12) Restore the snapshot data
13) Restore cassandra.yaml file
13) Restart cassandra

- I recommend practicing this on a test cluster first
- As you replace nodes with new IP Addresses, eventually all your seeds will 
need be updated.  This is not a big deal until all your seed nodes have been 
replaced.
- Don't forget about NTP!  Make sure it is running on all your new nodes.  
Myself, to be extra careful, I actually deleted the ntp drift file and let NTP 
recalculate it because its a new instance, and it took over an hour to restore 
our snapshot data... but that may have been overkill.
- If you have the opportunity, depending on your situation, increase the 
max_hint_window_in_ms
- Your details may vary

Thanks,
-Mike

On Apr 18, 2013, at 11:07 AM, Alain RODRIGUEZ wrote:

 I would say add your 3 servers to the 3 tokens where you want them, let's say 
 :
 
 {
 0: {
 0: 0,
 1: 56713727820156410577229101238628035242,
 2: 113427455640312821154458202477256070485
 }
 }
 
 or these token -1 or +1 if you already have these token used. And then just 
 decommission x1Large nodes. You should be good to go.
 
 
 
 2013/4/18 Kais Ahmed k...@neteck-fr.com
 Hi,
 
 What is the best pratice to move from a cluster of 7 nodes (m1.xlarge) to 3 
 nodes (hi1.4xlarge).
 
 Thanks,
 



Timestamps and CQL

2013-04-12 Thread Michael Theroux
Hello,

We are having an odd sporadic issue that I believe maybe due to time 
synchronization.  Without going into details on the issue right now, quick 
question, from the documentation I see numerous references that Cassandra 
utilizes timestamps generated by the clients to determine write serialization.  
However, going through the code, it appears that for CQL over thrift, it will 
only use client generated timestamps if USING TIMESTAMP is utilized in the 
cql statement, otherwise it will use a server-generated timestamp.

I was looking through 1.1.2 code.

Am I reading this correctly?

-Mike

Re: 13k pending compaction tasks but ZERO running?

2013-03-14 Thread Michael Theroux
Hi Dean,

I saw the same behavior when we switched from STCS to LCS on a couple of our 
tables.  Not sure why it doesn't proceed immediately (I pinged the list, but 
didn't get any feedback).  However, running nodetool compact keyspace table 
got things moving for me.

-Mike

On Mar 14, 2013, at 10:44 AM, Hiller, Dean wrote:

 How do I get my node to run through the 13k pending compaction tasks?  I had 
 to use iptables to take the ring out of the cluster for now and he is my only 
 node still on STCS.  In cassandra-cli, it shows LCS but on disk, I see a 
 36Gig file(ie. Must be STCS still).  How can I get the 13k pending tasks to 
 start running?
 
 Nodetool compactionstats ….
 pending tasks: 13793
 Active compaction remaining time :n/a
 
 Thanks,
 Dean



Re: 13k pending compaction tasks but ZERO running?

2013-03-14 Thread Michael Theroux
One more warning (which I'm sure you know, but in case others see this), 
nodetool compact does a major compaction for STS, and is in general, not 
recommended for STS.  I only ran it on the tables we've converted to LCS.

-Mike

On Mar 14, 2013, at 11:26 AM, Michael Theroux wrote:

 Hi Dean,
 
 I saw the same behavior when we switched from STCS to LCS on a couple of our 
 tables.  Not sure why it doesn't proceed immediately (I pinged the list, but 
 didn't get any feedback).  However, running nodetool compact keyspace 
 table got things moving for me.
 
 -Mike
 
 On Mar 14, 2013, at 10:44 AM, Hiller, Dean wrote:
 
 How do I get my node to run through the 13k pending compaction tasks?  I had 
 to use iptables to take the ring out of the cluster for now and he is my 
 only node still on STCS.  In cassandra-cli, it shows LCS but on disk, I see 
 a 36Gig file(ie. Must be STCS still).  How can I get the 13k pending tasks 
 to start running?
 
 Nodetool compactionstats ….
 pending tasks: 13793
 Active compaction remaining time :n/a
 
 Thanks,
 Dean
 



Re: About the heap

2013-03-14 Thread Michael Theroux
Hi Aaron,

If you have the chance, could you expand on m1.xlarge being the much better 
choice?  We are going to need to make a choice of expanding from a 12 node - 
24 node cluster using .large instances, vs. upgrading all instances to 
m1.xlarge, soon and the justifications would be helpful (although Aaron says so 
does help ;) ).  

One obvious reason is administrating a 24 node cluster does add person-time 
overhead.  

Another reason includes less impact of maintenance activities such as repair, 
as these activites have significant CPU overhead.  Doubling the cluster size 
would, in theory, halve the time for this overhead, but would still impact 
performance during that time.  Going to xlarge would lessen the impact of these 
activities on operations.

Anything else?

Thanks,

-Mike

On Mar 14, 2013, at 9:27 AM, aaron morton wrote:

 Because of this I have an unstable cluster and have no other choice than use 
 Amazon EC2 xLarge instances when we would rather use twice more EC2 Large 
 nodes.
 m1.xlarge is a MUCH better choice than m1.large.
 You get more ram and better IO and less steal. Using half as many m1.xlarge 
 is the way to go. 
 
 My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to 
 the max 8 GB (crashing the node).
 How is it crashing ?
 Are you getting too much GC or running OOM ? 
 Are you using the default GC configuration ?
 Is cassandra logging a lot of GC warnings ?
 
 If you are running OOM then something has to change. Maybe bloom filters, 
 maybe caches.
 
 Enable the GC logging in cassandra-env.sh to check how low a CMS compaction 
 get's the heap, or use some other tool. That will give an idea of how much 
 memory you are using. 
 
 Here is some background on what is kept on heap in pre 1.2
 http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 13/03/2013, at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote:
 
 Here is the JIRA I submitted regarding the ancestor.
 
 https://issues.apache.org/jira/browse/CASSANDRA-5342
 
 -Wei
 
 
 - Original Message -
 From: Wei Zhu wz1...@yahoo.com
 To: user@cassandra.apache.org
 Sent: Wednesday, March 13, 2013 11:35:29 AM
 Subject: Re: About the heap
 
 Hi Dean,
 The index_interval is controlling the sampling of the SSTable to speed up 
 the lookup of the keys in the SSTable. Here is the code:
 
 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DataTracker.java#L478
 
 To increase the interval meaning, taking less samples, less memory, slower 
 lookup for read.
 
 I did do a heap dump on my production system which caused about 10 seconds 
 pause of the node. I found something interesting, for LCS, it could involve 
 thousands of SSTables for one compaction, the ancestors are recorded in case 
 something goes wrong during the compaction. But those are never removed 
 after the compaction is done. In our case, it takes about 1G of heap memory 
 to store that. I am going to submit a JIRA for that. 
 
 Here is the culprit:
 
 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java#L58
 
 Enjoy looking at Cassandra code:)
 
 -Wei
 
 
 - Original Message -
 From: Dean Hiller dean.hil...@nrel.gov
 To: user@cassandra.apache.org
 Sent: Wednesday, March 13, 2013 11:11:14 AM
 Subject: Re: About the heap
 
 Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS 
 which gave us smaller bloomfilters.
 
 As far as key cache.  There is an entry in cassandra.yaml called 
 index_interval set to 128.  I am not sure if that is related to key_cache.  
 I think it is.  By turning that to 512 or maybe even 1024, you will consume 
 less ram there as well though I ran this test in QA and my key cache size 
 stayed the same so I am really not sure(I am actually checking out cassandra 
 code now to dig a little deeper into this property.
 
 Dean
 
 From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Wednesday, March 13, 2013 10:11 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: About the heap
 
 Hi,
 
 I would like to know everything that is in the heap.
 
 We are here speaking of C*1.1.6
 
 Theory :
 
 - Memtable (1024 MB)
 - Key Cache (100 MB)
 - Row Cache (disabled, and serialized with JNA activated anyway, so should 
 be off-heap)
 - BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter 
 Space Used and considering they are showed in Bytes - 1103765112)
 - Anything else ?
 
 So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing 
 slowly (from the new BF of my new data).
 
 My heap is actually changing from 3-4 

Re: Bloom filters and LCS

2013-03-08 Thread Michael Theroux
I think my impression that Bloom Filters were off in 1.1.9 was a 
misinterpretation of this thread:

http://www.mail-archive.com/user@cassandra.apache.org/msg27787.html

and this bug:

https://issues.apache.org/jira/browse/CASSANDRA-5029

I read it that Bloom filters were added to 1.2.2 for Bloomfilters, but 
apparently after being shutoff in an earlier version of 1.2?

-Mike



On Mar 7, 2013, at 4:48 PM, Edward Capriolo wrote:

 I read that the change was made because Cassandra does not work well when 
 they are off. This makes sense because cassandra uses bloom filters to decide 
 if a row can be deleted without major compaction. However since LCS does not 
 major compact without bloom filters you can end up in cases where rows never 
 get deleted.
 
 Edward
 
 On Thu, Mar 7, 2013 at 4:30 PM, Wei Zhu wz1...@yahoo.com wrote:
 Where did you read that bloom filters are off for LCS on 1.1.9?
 
 Those are the two issues I can find regarding this matter:
 
 https://issues.apache.org/jira/browse/CASSANDRA-4876
 https://issues.apache.org/jira/browse/CASSANDRA-5029
 
 Looks like in 1.2, it defaults at 0.1, not sure about 1.1.X
 
 -Wei
 
 - Original Message -
 From: Michael Theroux mthero...@yahoo.com
 To: user@cassandra.apache.org
 Sent: Thursday, March 7, 2013 1:18:38 PM
 Subject: Bloom filters and LCS
 
 Hello,
 
 (Hopefully) Quick question.
 
 We are running Cassandra 1.1.9.
 
 I recently converted some tables from Size tiered to Leveled Compaction.  The 
 amount of space for Bloom Filters on these tables went down tremendously 
 (which is expected, LCS in 1.1.9 does not use bloom filters).
 
 However, although its far less, its still using a number of megabytes.  Why 
 is it not zero?
 
 
 Column Family: 
 SSTable count: 526
 Space used (live): 7251063348
 Space used (total): 7251063348
 Number of Keys (estimate): 23895552
 Memtable Columns Count: 45719
 Memtable Data Size: 21207173
 Memtable Switch Count: 579
 Read Count: 21773431
 Read Latency: 4.155 ms.
 Write Count: 16183367
 Write Latency: 0.029 ms.
 Pending Tasks: 0
 Bloom Filter False Positives: 2442
 Bloom Filter False Ratio: 0.00245
 Bloom Filter Space Used: 44674656
 Compacted row minimum size: 73
 Compacted row maximum size: 105778
 Compacted row mean size: 1104
 
 Thanks,
 -Mike
 
 
 
 



Re: Size Tiered - Leveled Compaction

2013-03-08 Thread Michael Theroux
I've asked this myself in the past... fairly arbitrarily chose 10MB based on 
Wei's experience,

-Mike

On Mar 8, 2013, at 1:50 PM, Hiller, Dean wrote:

 +1  (I would love to know this info).
 
 Dean
 
 From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu 
 wz1...@yahoo.commailto:wz1...@yahoo.com
 Date: Friday, March 8, 2013 11:11 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Size Tiered - Leveled Compaction
 
 I have the same wonder.
 We started with the default 5M and the compaction after repair takes too long 
 on 200G node, so we increase the size to 10M sort of arbitrarily since there 
 is not much documentation around it. Our tech op team still thinks there are 
 too many files in one directory. To fulfill the guidelines from them (don't 
 remember the exact number, but something in the range of 50K files), we will 
 need to increase the size to around 50M. I think the latency of  opening one 
 file is not impacted much by the number of files in one directory for the 
 modern file system. But ls and other operations suffer.
 
 Anyway, I asked about the side effect of the bigger SSTable in IRC, someone 
 was mentioning during read, C* reads the whole SSTable from disk in order to 
 access the row which causes more disk IO compared with the smaller SSTable. I 
 don't know enough about the internal of the Cassandra, not sure whether it's 
 the case or not. If that is the case (with question mark) , the SSTable or 
 the row is kept in the memory? Hope someone can confirm the theory here. Or I 
 have to dig in to the source code to find it.
 
 Another concern is during repair, does it stream the whole SSTable or the 
 partial of it when mismatch is detected? I see the claim for both, can 
 someone please confirm also?
 
 The last thing is the effectiveness of the parallel LCS on 1.2. It takes 
 quite some time for the compaction to finish after repair for LCS for 1.1.X. 
 Both CPU and disk Util is low during the compaction which means LCS doesn't 
 fully utilized resource.  It will make the life easier if the issue is 
 addressed in 1.2.
 
 Bottom line is that there is not much documentation/guideline/successful 
 story around LCS although it sounds beautiful on paper.
 
 Thanks.
 -Wei
 
 From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Cc: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com
 Sent: Friday, March 8, 2013 1:25 AM
 Subject: Re: Size Tiered - Leveled Compaction
 
 I'm still wondering about how to chose the size of the sstable under LCS. 
 Defaul is 5MB, people use to configure it to 10MB and now you configure it at 
 128MB. What are the benefits or inconveniants of a very small size (let's say 
 5 MB) vs big size (like 128MB) ?
 
 Alain
 
 
 2013/3/8 Al Tobey a...@ooyala.commailto:a...@ooyala.com
 We saw the exactly the same thing as Wei Zhu,  100k tables in a directory 
 causing all kinds of issues.  We're running 128MiB ssTables with LCS and have 
 disabled compaction throttling.  128MiB was chosen to get file counts under 
 control and reduce the number of files C* has to manage  search. I just 
 looked and a ~250GiB node is using about 10,000 files, which is quite 
 manageable.  This configuration is running smoothly in production under mixed 
 read/write load.
 
 We're on RAID0 across 6 15k drives per machine. When we migrated data to this 
 cluster we were pushing well over 26k/s+ inserts with CL_QUORUM. With 
 compaction throttling enabled at any rate it just couldn't keep up. With 
 throttling off, it runs smoothly and does not appear to have an impact on our 
 applications, so we always leave it off, even in EC2.  An 8GiB heap is too 
 small for this config on 1.1. YMMV.
 
 -Al Tobey
 
 On Thu, Feb 14, 2013 at 12:51 PM, Wei Zhu 
 wz1...@yahoo.commailto:wz1...@yahoo.com wrote:
 I haven't tried to switch compaction strategy. We started with LCS.
 
 For us, after massive data imports (5000 w/seconds for 6 days), the first 
 repair is painful since there is quite some data inconsistency. For 150G 
 nodes, repair brought in about 30 G and created thousands of pending 
 compactions. It took almost a day to clear those. Just be prepared LCS is 
 really slow in 1.1.X. System performance degrades during that time since 
 reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We 
 tried everything we can and couldn't speed it up. I think it's single 
 threaded and it's not recommended to turn on multithread compaction. We 
 even tried that, it didn't help )There is parallel LCS in 1.2 which is 
 supposed to alleviate the pain. Haven't upgraded yet, hope it works:)
 
 

Bloom filters and LCS

2013-03-07 Thread Michael Theroux
Hello,

(Hopefully) Quick question.

We are running Cassandra 1.1.9.

I recently converted some tables from Size tiered to Leveled Compaction.  The 
amount of space for Bloom Filters on these tables went down tremendously (which 
is expected, LCS in 1.1.9 does not use bloom filters). 

However, although its far less, its still using a number of megabytes.  Why is 
it not zero?


Column Family: 
SSTable count: 526
Space used (live): 7251063348
Space used (total): 7251063348
Number of Keys (estimate): 23895552
Memtable Columns Count: 45719
Memtable Data Size: 21207173
Memtable Switch Count: 579
Read Count: 21773431
Read Latency: 4.155 ms.
Write Count: 16183367
Write Latency: 0.029 ms.
Pending Tasks: 0
Bloom Filter False Positives: 2442
Bloom Filter False Ratio: 0.00245
Bloom Filter Space Used: 44674656
Compacted row minimum size: 73
Compacted row maximum size: 105778
Compacted row mean size: 1104

Thanks,
-Mike




Re: -pr vs. no -pr

2013-02-28 Thread Michael Theroux
The way I've always thought about it is that -pr will make sure the information 
that specific node originates is consistent with its replicas.

So, we know that a node is responsible for a specific token range, and the next 
nodes in the ring will hold its replicas.  The -pr will make sure that a 
specific node's information is consistent to its replicas, but will not make 
sure a specific node has all the replicated information it can get from nodes 
previous to itself in the ring.

Without the -pr option, not only will the current node make sure its 
information and its replica's information is consistent, but it will also make 
sure that all the information that it is a replica for, is consistent.  

If you run regular repairs on all the nodes in your cluster, then -pr is 
sufficient.  Every node will run repair, and make sure its information is 
consistent with its replicas, eventually creating a fully consistent cluster.  
This is a quicker process, and will have less impact on your operations by 
essentially spreading out the pain.  

For instance, we run a 12 node cluster.  We run nodetool repair -pr on nodes 
that are opposite to each other, 4 nodes a day (2 nodes in the morning, 2 nodes 
in the evening).  With a grace period of 10 days, this allows us to run repairs 
twice a week on a specific node, and to occasionally skip repairs on specific 
nodes once a week.  

In this case, without -pr, a lot of extra work would be done.  In fact, with an 
RF of 3 (in our case), the time per repair would increase many fold.

Another way to thing about it... although likely not 100% technically correct..

A repair -pr will cause a push of a node's information to its replicas.  
Without the -pr, it will cause a push, and it will cause nodes it is a replica 
for to push their information as well.

-Mike

On Feb 28, 2013, at 9:39 PM, Hiller, Dean wrote:

 Isn't there more to it than that.  You really have nodes responsible for
 token ranges like so(using describe ring)
 
 What we see is this from our describe ringŠ(1 to 6 are token ranges while
 A to F are servers)Š.
 A - 1, 2, 3
 B - 2, 3, 4
 C - 3, 4, 5
 D - 4, 5, 6
 E - 5, 6, 1
 F - 6, 1, 2
 
 With -pr, only token range 1 is repaired I think, right?  2 and 3 are only
 repaired without the -pr option?  This means if I have a node that I just
 joined the cluster, I should not be using -pr as 2 and 3 on node A will
 not be up to date.  Using -pr is nice if I am going to repair every single
 node and is nice for the cron job that has to happen before
 gc_grace_seconds.  Am I wrong here?  Ie. -pr is really only good for use
 in the cron job as it would miss 2 and 3 above.  I could run the cron on
 just two servers but then my nodes are different which can be a hassle.
 
 Please verify that is what you believe is what happens as well?
 
 Thanks,
 Dean
 
 On 2/28/13 5:58 PM, Takenori Sato(Cloudian) ts...@cloudian.com wrote:
 
 Hi,
 
 Please note that I confirmed on v1.0.7.
 
 I mean a repair involves all three nodes and pushes and pulls data,
 right?
 
 Yes, but that's how -pr works. A repair without -pr does more.
 
 For example, suppose you have a ring with RF=3 like this.
 
 A - B - C - D - E - F
 
 Then, a repair on A without -pr does for 3 ranges as follows:
 [A, B, C]
 [E, F, A]
 [F, A, B]
 
 Among them, the first one, [A, B, C] is the primary range of A.
 
 So, with -pr, a repair runs only for:
 [A, B, C]
 
 I could run nodetool repair on just 2 nodes(RF=3) instead of using
 nodetool repair ­pr???
 
 Yes.
 
 You need to run two repairs on A and D.
 
 What is the advantage of ­pr then?
 
 Whenever you want to minimize rapair impacts.
 
 For example, suppose you got one node down for a while, and bring it
 back to the cluster.
 
 You need to run rapair without affecting the entire cluster. Then, -pr
 is the option.
 
 Thanks,
 Takenori
 
 (2013/03/01 7:39), Hiller, Dean wrote:
 Isn't it true if I have 6 nodes, I could run nodetool repair on just 2
 nodes(RF=3) instead of using nodetool repair ­pr???
 
 What is the advantage of ­pr then?
 
 I mean a repair involves all three nodes and pushes and pulls data,
 right?
 
 Thanks,
 Dean
 
 



Re: Size Tiered - Leveled Compaction

2013-02-14 Thread Michael Theroux
BTW, when I say major compaction, I mean running the nodetool compact 
command (which does a major compaction for Sized Tiered Compaction).  I didn't 
see the distribution of SSTables I expected until I ran that command, in the 
steps I described below.  

-Mike

On Feb 14, 2013, at 3:51 PM, Wei Zhu wrote:

 I haven't tried to switch compaction strategy. We started with LCS. 
 
 For us, after massive data imports (5000 w/seconds for 6 days), the first 
 repair is painful since there is quite some data inconsistency. For 150G 
 nodes, repair brought in about 30 G and created thousands of pending 
 compactions. It took almost a day to clear those. Just be prepared LCS is 
 really slow in 1.1.X. System performance degrades during that time since 
 reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We 
 tried everything we can and couldn't speed it up. I think it's single 
 threaded and it's not recommended to turn on multithread compaction. We 
 even tried that, it didn't help )There is parallel LCS in 1.2 which is 
 supposed to alleviate the pain. Haven't upgraded yet, hope it works:)
 
 http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2
 
 
 Since our cluster is not write intensive, only 100 w/seconds. I don't see any 
 pending compactions during regular operation. 
 
 One thing worth mentioning is the size of the SSTable, default is 5M which is 
 kind of small for 200G (all in one CF) data set, and we are on SSD.  It more 
 than  150K files in one directory. (200G/5M = 40K SSTable and each SSTable 
 creates 4 files on disk)  You might want to watch that and decide the SSTable 
 size. 
 
 By the way, there is no concept of Major compaction for LCS. Just for fun, 
 you can look at a file called $CFName.json in your data directory and it 
 tells you the SSTable distribution among different levels. 
 
 -Wei
 
 From: Charles Brophy cbro...@zulily.com
 To: user@cassandra.apache.org 
 Sent: Thursday, February 14, 2013 8:29 AM
 Subject: Re: Size Tiered - Leveled Compaction
 
 I second these questions: we've been looking into changing some of our CFs to 
 use leveled compaction as well. If anybody here has the wisdom to answer them 
 it would be of wonderful help.
 
 Thanks
 Charles
 
 On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com wrote:
 Hello,
 
 I'm investigating the transition of some of our column families from Size 
 Tiered - Leveled Compaction.  I believe we have some high-read-load column 
 families that would benefit tremendously.
 
 I've stood up a test DB Node to investigate the transition.  I successfully 
 alter the column family, and I immediately noticed a large number (1000+) 
 pending compaction tasks become available, but no compaction get executed.
 
 I tried running nodetool sstableupgrade on the column family, and the 
 compaction tasks don't move.
 
 I also notice no changes to the size and distribution of the existing 
 SSTables.
 
 I then run a major compaction on the column family.  All pending compaction 
 tasks get run, and the SSTables have a distribution that I would expect from 
 LeveledCompaction (lots and lots of 10MB files).
 
 Couple of questions:
 
 1) Is a major compaction required to transition from size-tiered to leveled 
 compaction?
 2) Are major compactions as much of a concern for LeveledCompaction as their 
 are for Size Tiered?
 
 All the documentation I found concerning transitioning from Size Tiered to 
 Level compaction discuss the alter table cql command, but I haven't found too 
 much on what else needs to be done after the schema change.
 
 I did these tests with Cassandra 1.1.9.
 
 Thanks,
 -Mike
 
 
 



Read operations resulting in a write?

2012-12-14 Thread Michael Theroux
Hello,

We have an unusual situation that I believe I've reproduced, at least 
temporarily, in a test environment.  I also think I see where this issue is 
occurring in the code.

We have a specific column family that is under heavy read and write load on a 
nightly basis.   For the purposes of this description, I'll refer to this 
column family as Bob.  During this nightly processing, sometimes Bob is under 
very write load, other times it is very heavy read load.

The application is such that when something is written to Bob, a write is made 
to one of two other tables.  We've witnessed a situation where the write count 
on Bob far outstrips the write count on either of the other tables, by a factor 
of 3-10.  This is based on the WriteCount available on the column family JMX 
MBean.  We have not been able to find where in our code this is happening, and 
we have gone as far as tracing our CQL calls to determine that the relationship 
between Bob and the other tables are what we expect.

I brought up a test node to experiment, and see a situation where, when a 
select statement is executed, a write will occur.

In my test, I perform the following (switching between nodetool and cqlsh):

update bob set 'about'='coworker' where key='hex key';    
nodetool flush
update bob set 'about'='coworker' where key='hex key';    
nodetool flush
update bob set 'about'='coworker' where key='hex key';    
nodetool flush
update bob set 'about'='coworker' where key='hex key';    
nodetool flush
update bob set 'about'='coworker' where key='hex key';    
nodetool flush

Then, for a period of time (before a minor compaction occurs), a select 
statement that selects specific columns will cause writes to occur in the write 
count of the column family:

select about,changed,data from bob where key='hex key';

This situation will continue until a minor compaction is completed.

I went into the code and added some traces to CollationController.java:

private ColumnFamily collectTimeOrderedData() { 
logger.debug(collectTimeOrderedData); ... snip ... 

--- HERE   logger.debug( tables iterated:  + sstablesIterated +   Min 
compact:  + cfs.getMinimumCompactionThreshold() );
// hoist up the requested data into a more recent sstable if 
(sstablesIterated  cfs.getMinimumCompactionThreshold()  
!cfs.isCompactionDisabled()  cfs.getCompactionStrategy() instanceof 
SizeTieredCompactionStrategy) { RowMutation rm = new 
RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe())); try {
--- HERE  logger.debug( Apply hoisted up row mutation );
// skipping commitlog and index updates is fine since we're just de-fragmenting 
existing data Table.open(rm.getTable()).apply(rm, false, false); } catch 
(IOException e) { // log and allow the result to be returned 
logger.error(Error re-writing read results, e); } } 
... snip ...

Performing the steps above, I see the following traces (in the test environment 
I decreased the minimum compaction threshold to make this easier to reproduce). 
 After I do a couple of update/flush, I see this in the log:

DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 117) 
Scheduling a background task check for bob with SizeTieredCompactionStrategy


Then, until compaction occurs, I see (when performing a select):

DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 86) 
Disseminating load info ...
DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) 
execute_cql_query
DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL 
statement type: SELECT
DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) 
Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', 
key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', 
columnName='null')', columns=[about,changed,data,])/ONE
DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) Blockfor 
is 1; setting up requests to /10.0.4.20
DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) reading 
data locally
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) 
LocalReadRunnable reading SliceByNamesReadCommand(table='open', 
key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', 
columnName='null')', columns=[about,changed,data,])
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 68) 
In get top level columns: class org.apache.cassandra.db.filter.NamesQueryFilter 
type: Standard valid: class org.apache.cassandra.db.marshal.BytesType
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 84) 
collectTimeOrderedData
--- DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java 
(line 188) tables iterated: 4 Min compact: 2

 DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 

Cassandra compression not working?

2012-09-24 Thread Michael Theroux
Hello,

We are running into an unusual situation that I'm wondering if anyone has any 
insight on.  We've been running a Cassandra cluster for some time, with 
compression enabled on one column family in which text documents are stored.  
We enabled compression on the column family, utilizing the SnappyCompressor and 
a 64k chunk length.

It was recently discovered that Cassandra was reporting a compression ratio of 
0.  I took a snapshot of the data and started a cassandra node in isolation to 
investigate.

Running nodetool scrub, or nodetool upgradesstables had little impact on the 
amount of data that was being stored.

I then disabled compression and ran nodetool upgradesstables on the column 
family.  Again, not impact on the data size stored.

I then reenabled compression and ran nodetool upgradesstables on the column 
family.  This resulting in a 60% reduction in the data size stored, and 
Cassandra reporting a compression ration of about .38.

Any idea what is going on here?  Obviously I can go through this process in 
production to enable compression, however, any idea what is currently happening 
and why new data does not appear to be compressed?

Any insights are appreciated,
Thanks,
-Mike

Re: Cassandra Messages Dropped

2012-09-23 Thread Michael Theroux
There were no errors in the log (other than the messages dropped exception 
pasted below), and the node does recover.  We have only a small number of 
secondary indexes (3 in the whole system).

However, I went through the cassandra code, and I believe I've worked through 
this problem.

Just to finish out this thread, I realized that when you see:

INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 
72) FlushWriter   1 5 0

It is an issue.  Cassandra will at various times enqueue many memtables for 
flushing.  By default, the queue size for this is 4.  If more than 5 memtables 
get queued for flushing (4 + 1 for the one currently being flushed), a lock 
will be acquired and held across all tables until all memtables that need to be 
flushed are enqueued.  If it takes more than rpc_timeout_time_in_ms time to 
flush enough information to allow all the pending memtables to be enqueued, a 
messages dropped will occur.  To put in other words, Cassandra will lock down 
all tables until all pending flush requests fit in the pending queue.  If your 
queue size is 4, and 8 tables need to be flushed, Cassandra will lock down all 
tables until a minimum of 3 memtables are flushed.

With this in mind, I went through the cassandra log and found this was indeed 
the case looking at log entries similar to these:

 INFO [OptionalTasks:1] 2012-09-16 05:54:29,750 ColumnFamilyStore.java (line 
643) Enqueuing flush of Memtable-p@1525015234(18686281/341486464 
serialized/live bytes, 29553 ops)
...
INFO [FlushWriter:29] 2012-09-16 05:54:29,768 Memtable.java (line 266) Writing 
Memtable-p@1525015234(18686281/341486464 serialized/live bytes, 29553 ops)
...
INFO [FlushWriter:29] 2012-09-16 05:54:30,254 Memtable.java (line 307) 
Completed flushing /data/cassandra/data/open/people/open-p-hd-441-Data.db

I was able to figure out what the rpc_timeout_in_ms needed to be to temporarily 
prevent the problem.

We had plenty of write I/O available.  We also had free memory.  I increased 
the memtable_flush_writers to 2 and memtable_flush_queue_size to 8.  We 
haven't had any timeouts for a number of days now.

Thanks for your help,
-Mike

On Sep 18, 2012, at 5:14 AM, aaron morton wrote:

 Any errors in the log ?
 
 The node recovers ? 
 
 Do you use secondary indexes ? If so check comments for  
 memtable_flush_queue_size in the yaml. if this value is too low writes may 
 back up. But I would not expect it to cause dropped messages. 
 
 nodetool info also shows we have over a gig of available memory on the JVM 
 heap of each node.
 
 Not all memory is created equal :)
 ParNew is kicking in to GC the Eden space in the New Heap. 
  
 It may just be that the node is getting hammered by something and IO is 
 getting overwhelmed. If you can put the logs up someone might take a look. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 18/09/2012, at 3:46 PM, Michael Theroux mthero...@yahoo.com wrote:
 
 Thanks for the response.
 
 We are on version 1.1.2.  We don't see the MutationStage back up.  The dump 
 from the messages dropped error doesn't show a backup, but also watching 
 nodetool tpstats doesn't show any backup there.
 
 nodetool info also shows we have over a gig of available memory on the JVM 
 heap of each node.
 
 The earliest GCInspector traces I see before one of the more recent 
 incidents in which messages were dropped are:
 
  INFO [ScheduledTasks:1] 2012-09-18 02:25:53,928 GCInspector.java (line 
 122) GC for ParNew: 396 ms for 1 collections, 2064505088 used; max is 
 4253024256
  
  NFO [ScheduledTasks:1] 2012-09-18 02:25:55,929 GCInspector.java (line 
 122) GC for ParNew: 485 ms for 1 collections, 1961875064 used; max is 
 4253024256
  
  INFO [ScheduledTasks:1] 2012-09-18 02:25:57,930 GCInspector.java (line 
 122) GC for ParNew: 265 ms for 1 collections, 1968074096 used; max is 
 4253024256
 
 But this was 45 minutes before messages were dropped.
 
 It's appreciated,
 -Mike
  
 On Sep 17, 2012, at 11:27 PM, aaron morton wrote:
 
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 
 72) MemtablePostFlusher   1 5 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 
 72) FlushWriter   1 5 0
 Looks suspiciously like 
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201209.mbox/%3c9fb0e801-b1ed-41c4-9939-bafbddf15...@thelastpickle.com%3E
 
 What version are you on ? 
 
 Are there any ERROR log messages before this ? 
 
 Are you seeing MutationStage back up ? 
 
 Are you see log messages from GCInspector ?
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 18/09/2012, at 2:16 AM, Michael Theroux mthero...@yahoo.com wrote:
 
 Hello,
 
 While under load, we have occasionally been seeing messages dropped 
 errors

Secondary index loss on node restart

2012-09-23 Thread Michael Theroux
Hello,

We have been noticing an issue where, about 50% of the time in which a node 
fails or is restarted, secondary indexes appear to be partially lost or 
corrupted.  A drop and re-add of the index appears to correct the issue.  There 
are no errors in the cassandra logs that I see.  Part of the index seems to be 
simply missing.  Sometimes this corruption/loss doesn't happen immediately, but 
sometime after the node is restarted.  In addition, the index never appears to 
have an issue when the node comes down, it is only after the node comes back up 
and recovers in which we experience an issue.

We developed some code that goes through all the rows in the table, by key, in 
which the index is present.  It then attempts to look up the information via 
secondary index, in an attempt to detect when the issue occurs.  Another odd 
observation is that the number of members present in the index when we have the 
issue varies up and down (the index and the tables don't change that often).

We are running a 6 node Cassandra cluster with a replication factor of 3, 
consistency level for all queries is LOCAL_QUORUM.  We are running Cassandra 
1.1.2.

Anyone have any insights?

-Mike

Re: Cassandra Messages Dropped

2012-09-23 Thread Michael Theroux
Love the Mars lander analogies :)

On Sep 23, 2012, at 5:39 PM, aaron morton wrote:

 To put in other words, Cassandra will lock down all tables until all pending 
 flush requests fit in the pending queue.
 This was the first issue I looked at in my Cassandra SF talk 
 http://www.datastax.com/events/cassandrasummit2012/presentations
 
 I've seen it occur more often with lots-o-secondary indexes. 
  
 
 We had plenty of write I/O available.  We also had free memory.  I increased 
 the memtable_flush_writers to 2 and memtable_flush_queue_size to 8.  We 
 haven't had any timeouts for a number of days now.
 Cool. 
 
 Cheers
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 24/09/2012, at 6:09 AM, Michael Theroux mthero...@yahoo.com wrote:
 
 There were no errors in the log (other than the messages dropped exception 
 pasted below), and the node does recover.  We have only a small number of 
 secondary indexes (3 in the whole system).
 
 However, I went through the cassandra code, and I believe I've worked 
 through this problem.
 
 Just to finish out this thread, I realized that when you see:
 
  INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 
 72) FlushWriter   1 5 0
 
 It is an issue.  Cassandra will at various times enqueue many memtables for 
 flushing.  By default, the queue size for this is 4.  If more than 5 
 memtables get queued for flushing (4 + 1 for the one currently being 
 flushed), a lock will be acquired and held across all tables until all 
 memtables that need to be flushed are enqueued.  If it takes more than 
 rpc_timeout_time_in_ms time to flush enough information to allow all the 
 pending memtables to be enqueued, a messages dropped will occur.  To put 
 in other words, Cassandra will lock down all tables until all pending flush 
 requests fit in the pending queue.  If your queue size is 4, and 8 tables 
 need to be flushed, Cassandra will lock down all tables until a minimum of 3 
 memtables are flushed.
 
 With this in mind, I went through the cassandra log and found this was 
 indeed the case looking at log entries similar to these:
 
  INFO [OptionalTasks:1] 2012-09-16 05:54:29,750 ColumnFamilyStore.java (line 
 643) Enqueuing flush of Memtable-p@1525015234(18686281/341486464 
 serialized/live bytes, 29553 ops)
 ...
 INFO [FlushWriter:29] 2012-09-16 05:54:29,768 Memtable.java (line 266) 
 Writing Memtable-p@1525015234(18686281/341486464 serialized/live bytes, 
 29553 ops)
 ...
 INFO [FlushWriter:29] 2012-09-16 05:54:30,254 Memtable.java (line 307) 
 Completed flushing /data/cassandra/data/open/people/open-p-hd-441-Data.db
 
 I was able to figure out what the rpc_timeout_in_ms needed to be to 
 temporarily prevent the problem.
 
 We had plenty of write I/O available.  We also had free memory.  I increased 
 the memtable_flush_writers to 2 and memtable_flush_queue_size to 8.  We 
 haven't had any timeouts for a number of days now.
 
 Thanks for your help,
 -Mike
 
 On Sep 18, 2012, at 5:14 AM, aaron morton wrote:
 
 Any errors in the log ?
 
 The node recovers ? 
 
 Do you use secondary indexes ? If so check comments for  
 memtable_flush_queue_size in the yaml. if this value is too low writes may 
 back up. But I would not expect it to cause dropped messages. 
 
 nodetool info also shows we have over a gig of available memory on the JVM 
 heap of each node.
 
 Not all memory is created equal :)
 ParNew is kicking in to GC the Eden space in the New Heap. 
  
 It may just be that the node is getting hammered by something and IO is 
 getting overwhelmed. If you can put the logs up someone might take a look. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 18/09/2012, at 3:46 PM, Michael Theroux mthero...@yahoo.com wrote:
 
 Thanks for the response.
 
 We are on version 1.1.2.  We don't see the MutationStage back up.  The 
 dump from the messages dropped error doesn't show a backup, but also 
 watching nodetool tpstats doesn't show any backup there.
 
 nodetool info also shows we have over a gig of available memory on the JVM 
 heap of each node.
 
 The earliest GCInspector traces I see before one of the more recent 
 incidents in which messages were dropped are:
 
INFO [ScheduledTasks:1] 2012-09-18 02:25:53,928 GCInspector.java (line 
 122) GC for ParNew: 396 ms for 1 collections, 2064505088 used; max is 
 4253024256
  
NFO [ScheduledTasks:1] 2012-09-18 02:25:55,929 GCInspector.java (line 
 122) GC for ParNew: 485 ms for 1 collections, 1961875064 used; max is 
 4253024256
  
INFO [ScheduledTasks:1] 2012-09-18 02:25:57,930 GCInspector.java (line 
 122) GC for ParNew: 265 ms for 1 collections, 1968074096 used; max is 
 4253024256
 
 But this was 45 minutes before messages were dropped.
 
 It's appreciated,
 -Mike
  
 On Sep 17, 2012, at 11:27 PM, aaron morton wrote:
 
 INFO [ScheduledTasks:1

Cassandra Messages Dropped

2012-09-17 Thread Michael Theroux
Hello,

While under load, we have occasionally been seeing messages dropped errors in 
our cassandra log.  Doing some research, I understand this is part of 
Cassandra's design to shed load, and we should look at the tpstats-like output 
to determine what should be done to resolve the situation.  Typically, you will 
see lots of messages blocked or pending, and that might be an indicator that a 
specific part of hardware needs to be improved/tuned/upgraded.  

However, looking at the output we are getting, I'm finding it difficult to see 
what needs to be tuned, as it looks to me cassandra is handling the load within 
the mutation stage:

NFO [ScheduledTasks:1] 2012-09-17 06:28:03,266 MessagingService.java (line 658) 
119 MUTATION messages dropped in last 5000ms
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,645 StatusLogger.java (line 57) 
Pool NameActive   Pending   Blocked
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,836 StatusLogger.java (line 72) 
ReadStage 3 3 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,837 StatusLogger.java (line 72) 
RequestResponseStage  0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,837 StatusLogger.java (line 72) 
ReadRepairStage   0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,837 StatusLogger.java (line 72) 
MutationStage 0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,838 StatusLogger.java (line 72) 
ReplicateOnWriteStage 0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,838 StatusLogger.java (line 72) 
GossipStage   0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) 
AntiEntropyStage  0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) 
MigrationStage0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) 
StreamStage   0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) 
MemtablePostFlusher   1 5 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) 
FlushWriter   1 5 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) 
MiscStage 0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) 
commitlog_archiver0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,841 StatusLogger.java (line 72) 
InternalResponseStage 0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,841 StatusLogger.java (line 72) 
AntiEntropySessions   0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,851 StatusLogger.java (line 72) 
HintedHandoff 0 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,851 StatusLogger.java (line 77) 
CompactionManager 0 0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,852 StatusLogger.java (line 89) 
MessagingServicen/a   0,0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,852 StatusLogger.java (line 99) 
Cache Type Size Capacity   
KeysToSave Provider
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,853 StatusLogger.java (line 100) 
KeyCache2184533  2184533
  all 
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,853 StatusLogger.java (line 106) 
RowCache  00
  all  org.apache.cassandra.cache.SerializingCacheProvider
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,853 StatusLogger.java (line 113) 
ColumnFamilyMemtable ops,data
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,853 StatusLogger.java (line 116) 
system.NodeIdInfo 0,0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,854 StatusLogger.java (line 116) 
system.IndexInfo  0,0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,854 StatusLogger.java (line 116) 
system.LocationInfo   0,0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,854 StatusLogger.java (line 116) 
system.Versions   0,0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,855 StatusLogger.java (line 116) 
system.schema_keyspaces   0,0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,855 StatusLogger.java (line 116) 
system.Migrations 0,0
 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,855 StatusLogger.java (line 116) 
system.schema_columnfamilies 0,0
 

Cassandra, AWS and EBS Optimized Instances/Provisioned IOPs

2012-09-12 Thread Michael Theroux
Hello,

A number of weeks ago, Amazon announced the availability of EBS Optimized 
instances and Provisioned IOPs for Amazon EC2.  Historically, I've read EBS is 
not recommended for Cassandra due to the network contention that can quickly 
result (http://www.datastax.com/docs/1.0/cluster_architecture/cluster_planning).

Costs put aside, and assuming everything promoted by Amazon is accurate, with 
the existence of provisioned IOPs, is EBS now a better option than before?  
Taking the points against EBS mentioned in the link above:

EBS volumes contend directly for network throughput with standard packets. This 
means that EBS throughput is likely to fail if you saturate a network link.
According to Amazon, Provisioned IOPS guarantees to be within 10% of  of the 
provisioned performance 99.9%of the time.  This would mean that throughput 
should no longer fail.
EBS volumes have unreliable performance. I/O performance can be exceptionally 
slow, causing the system to backload reads and writes until the entire cluster 
becomes unresponsive.
Same point as above.
Adding capacity by increasing the number of EBS volumes per host does not 
scale. You can easily surpass the ability of the system to keep effective 
buffer caches and concurrently serve requests for all of the data it is 
responsible for managing.
I believe this may be still true, although I'm not entirely sure why this is 
more true for EBS volumes vs. emphemeral.

Any real world experience out there with these new EBS options?

-Mike



Re: nodetool repair

2012-07-15 Thread Michael Theroux
So, if I have a 6 node cluster in the token ring, A-B-C-D-E-F, replication 
factor 3, and I run repair (without -pr) on A, is the flow of information:

A synchronizes information it is responsible for with B and C (because B and C 
are replicas of A).
A, as a replica of E and F, synchronizes E and F's information to itself.

If I re-ran repair of E, it would actually be redoing the work of synchronizing 
with A, but will still be doing the worthwhile work of synchronizing with F.

Running repair with -pr will prevent this duplicate work, if you ran it on 
each node.  If I ran repair on A with -pr, A synchronizes its primary range 
with B and C, but does not perform synchronization work with E and F.

Does this sound correct?

-Mike

On Jul 15, 2012, at 9:47 AM, Edward Capriolo wrote:

 Great job sleuthing.
 
 Originally repair did not have a -pr. When you run the standard repair
 the node compares it's data with its neighbours and vice versa. They
 also send each other updates. Since you are supposed to repair every
 node  gc_grace submitting a full repair to each node would create
 duplicated work, since a repair on node A has an effect on node B and
 node C.
 
 If you want to understand this some more you should run
 compactionstats and netstats across your cluster while a repair is
 going on, then you can see what effect the commands have on other
 nodes.
 
 I will try to write up some documentation on it as well because -pr is
 a nice feature. Many may not even be expressly aware of it.
 
 On Sat, Jul 14, 2012 at 2:00 PM, Michael Theroux mthero...@yahoo.com wrote:
 Hello,
 
 I'm looking at nodetool repair with the -pr, vs. non -pr option.  
 Looking around, I'm seeing a lot of conflicting information out there.  
 Almost universally, the recommendation is to run nodetool repair with the 
 -pr for any day-to-day maintenance.
 
 This is my understanding of how it works.  I appreciate any corrections to 
 my misinformation.
 
 nodetool repair -pr
 
 - This performs a repair on the primary range of the node.  The primary 
 range is essentially the part of the ring that the node is responsible for.  
 When this command is run, synchronization of replicas will occur for the 
 rows that this node is responsible for.  If replicas are missing from that 
 node's neighbors for those rows, they will be replicated.
 
 nodetool repair
 
 - This is where I see a lot of conflicting information.  I see a lot of 
 answers in which there is a suggestion that this command will perform a 
 repair across the entire cluster.  However, I don't believe this is true 
 from my observations (and some of the items I read seems to agree with 
 this).  Instead, this command performs synchronization of your primary 
 range, but also for other ranges that this node maybe responsible for in a 
 replica capacity.  The way I'm thinking about it is that the -pr option 
 causes repairs to push information from its primary range to replicas.  
 Without -pr, nodetool replair does a push, and pull for its neighbors that 
 this node maybe a replica for.  This makes sense to me, as people recommend 
 running nodetool repair after a node has been down.  This is to allow the 
 downed node to get any missed information that should have been replicated 
 to it while it was down.
 
 I'm sure there lots of flaws to the above understanding as I'm cobbling it 
 together.  I appreciate the feedback,
 
 -Mike



Re: Increased replication factor not evident in CLI

2012-07-15 Thread Michael Theroux
Just to completely eliminate the possibility of the same bug, if you look here:

http://www.mail-archive.com/dev@cassandra.apache.org/msg04992.html

If you create a test keyspace, and look at the timestamp in the 
schema_keyspaces column family in comparison to your existing keyspace, is 
that timestamp greater?

Thanks,
-Mike

On Jul 12, 2012, at 8:56 PM, Michael Theroux wrote:

 Sounds a lot like a bug that I hit that was filed and fixed recently:
 
 https://issues.apache.org/jira/browse/CASSANDRA-4432
 
 -Mike
 
 On Jul 12, 2012, at 8:16 PM, Edward Capriolo wrote:
 
 Possibly the bug with nanotime causing cassandra to think the change 
 happened in the past. Talked about onlist in past few days.
 On Thursday, July 12, 2012, aaron morton aa...@thelastpickle.com wrote:
  Do multiple nodes say the RF is 2 ? Can you show the output from the CLI ? 
  Do show schema and show keyspace say the same thing ?
  Cheers
 
 
  -
  Aaron Morton
  Freelance Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 13/07/2012, at 7:39 AM, Dustin Wenz wrote:
 
  We recently increased the replication factor of a keyspace in our 
  cassandra 1.1.1 cluster from 2 to 4. This was done by setting the 
  replication factor to 4 in cassandra-cli, and then running a repair on 
  each node.
 
  Everything seems to have worked; the commands completed successfully and 
  disk usage increased significantly. However, if I perform a describe on 
  the keyspace, it still shows replication_factor:2. So, it appears that the 
  replication factor might be 4, but it reports as 2. I'm not entirely sure 
  how to confirm one or the other.
 
  Since then, I've stopped and restarted the cluster, and even ran an 
  upgradesstables on each node. The replication factor still doesn't report 
  as I would expect. Am I missing something here?
 
  - .Dustin
 
 
 
 
 
 

 
  
 


Re: Increased replication factor not evident in CLI

2012-07-12 Thread Michael Theroux
Sounds a lot like a bug that I hit that was filed and fixed recently:

https://issues.apache.org/jira/browse/CASSANDRA-4432

-Mike

On Jul 12, 2012, at 8:16 PM, Edward Capriolo wrote:

 Possibly the bug with nanotime causing cassandra to think the change happened 
 in the past. Talked about onlist in past few days.
 On Thursday, July 12, 2012, aaron morton aa...@thelastpickle.com wrote:
  Do multiple nodes say the RF is 2 ? Can you show the output from the CLI ? 
  Do show schema and show keyspace say the same thing ?
  Cheers
 
 
  -
  Aaron Morton
  Freelance Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 13/07/2012, at 7:39 AM, Dustin Wenz wrote:
 
  We recently increased the replication factor of a keyspace in our cassandra 
  1.1.1 cluster from 2 to 4. This was done by setting the replication factor 
  to 4 in cassandra-cli, and then running a repair on each node.
 
  Everything seems to have worked; the commands completed successfully and 
  disk usage increased significantly. However, if I perform a describe on the 
  keyspace, it still shows replication_factor:2. So, it appears that the 
  replication factor might be 4, but it reports as 2. I'm not entirely sure 
  how to confirm one or the other.
 
  Since then, I've stopped and restarted the cluster, and even ran an 
  upgradesstables on each node. The replication factor still doesn't report 
  as I would expect. Am I missing something here?
 
  - .Dustin
 
 
 



Re: Serious issue updating Cassandra version and topology

2012-07-10 Thread Michael Theroux
Hello Aaron,

Thank you for responding.  Since the time of my original email, we noticed that 
in the process of performing this upgrade that data was lost.  We have restored 
from backup and are now trying this again with two changes:

1) We will be using 1.1.2 throughout the cluster
2) We have switched back to Tiered compaction

In the process I've hit another very interesting issue that I will write a 
separate email about.

However, to answer your questions, this happened on the 1.1.2 node and it 
happened against after you ran the scrub.  The data has been around for a 
while.  We upgraded from 1.0.7 - 1.1.2.

Unfortunately, I can't check the sstables as we've restarted the migration from 
the beginning.  If it happens again, I'll respond with more information.  

Thanks again,
-Mike

On Jul 10, 2012, at 5:05 AM, aaron morton wrote:

 To be clear, this happened on a 1.1.2 node and it happened again *after* you 
 had run a scrub ? 
 
 Has this cluster been around for a while or was the data created with 1.1 ?
 
 Can you confirm that all sstables were re-written for the CF? Check the 
 timestamp on the files. Also also files should have the same version, the 
 -h?- part of the name.
 
 Can you repair the other CF's ? 
 
 If this cannot be repaired by scrub or upgradetables you may need to cut the 
 row out of the sstables. Using sstable2json and json2sstable. 
 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 8/07/2012, at 4:05 PM, Michael Theroux wrote:
 
 Hello,
 
 We're in the process of trying to move a 6-node cluster from RF=1 to RF=3. 
 Once our replication factor was upped to 3, we ran nodetool repair, and 
 immediately hit an issue on the first node we ran repair on:
 
 INFO 03:08:51,536 Starting repair command #1, repairing 2 ranges.
 INFO 03:08:51,552 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] new 
 session: will sync xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101, 
 /10.29.187.61 on range 
 (Token(bytes[d558]),Token(bytes[])]
  for x.[a, b, c, d, e, f, g, h, i, 
 j, k, l, m, n, o, p, q, r, s]
 INFO 03:08:51,555 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] requesting 
 merkle trees for a (to [/10.29.187.61, 
 xxx-xx-xx-xxx-compute-1.amazonaws.com/10.202.99.101])
 INFO 03:08:52,719 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
 merkle tree for a from /10.29.187.61
 INFO 03:08:53,518 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
 merkle tree for a from 
 xxx-xx-xx-xxx-.compute-1.amazonaws.com/10.202.99.101
 INFO 03:08:53,519 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] requesting 
 merkle trees for b (to [/10.29.187.61, 
 xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101])
 INFO 03:08:53,639 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Endpoints 
 /10.29.187.61 and xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101 
 are consistent for a
 INFO 03:08:53,640 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] a is 
 fully synced (18 remaining column family to sync for this session)
 INFO 03:08:54,049 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
 merkle tree for b from /10.29.187.61
 ERROR 03:09:09,440 Exception in thread Thread[ValidationExecutor:1,1,main]
 java.lang.AssertionError: row 
 DecoratedKey(Token(bytes[efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47]),
  efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47) received 
 out of order wrt 
 DecoratedKey(Token(bytes[f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb]),
  f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb)
  at 
 org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:349)
  at 
 org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:712)
  at 
 org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:68)
  at 
 org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:438)
  at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
  at java.util.concurrent.FutureTask.run(Unknown Source)
  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
  at java.lang.Thread.run(Unknown Source)
 
 It looks from the log above, the sync of the a column family was 
 successful.  However, the b column family resulted in this error.  In 
 addition, the repair hung after this error.  We ran node tool scrub on all 
 nodes and invalidated the key and row caches and tried again (with RF=2), 
 and it didn't help alleviate the problem.
 
 Some other important pieces of information:
 We use ByteOrderedPartitioner (we MD5 hash

Re: Expanding Cassandra on EC2 with consistency

2012-07-03 Thread Michael Theroux
Yes, I saw LOCAL_QUORUM.  The definition I saw was:

Ensure that the write has been written to ReplicationFactor / 2 + 1 nodes, 
within the local datacenter (requiresNetworkTopologyStrategy)

This will allow a quorum  within a datacenter.  However, I think this means 
that if availability zones are racks, that the quorum would still be across 
availability zones.

-Mike

On Jul 3, 2012, at 9:22 AM, Robin Verlangen wrote:

 Hi Mike,
 
 I'm not sure about all your questions, however you should take a look at 
 LOCAL_QUORUM for your question about consistency level reads/writes.
 
 2012/7/3 Michael Theroux mthero...@yahoo.com
 Hello,
 
 We are currently running a web application utilizing Cassandra on EC2.  Given 
 the recent outages experienced with Amazon, we want to consider expanding 
 Cassandra across availability zones sooner rather than later.
 
 We are trying to determine the optimal way to deploy Cassandra in this 
 deployment.  We are researching the NetworkTopologyStrategy, and the 
 EC2Snitch.  We are also interested in providing a high level of read or write 
 consistency,
 
 My understanding is that the EC2Snitch recognizes availability zones as 
 racks, and regions as data-centers.  This seems to be a common configuration. 
  However, if we were to want to utilize queries with a READ or WRITE 
 consistency of QUORUM, would there be a high possibility that the 
 communication necessary to establish a quorum, across availability zones?
 
 My understanding is that the NetworkTopologyStrategy attempts to prefer 
 replicas be stored on other racks within the datacenter, which would equate 
 to other availability zones in EC2.  This implies to me that in order to have 
 the quorum of nodes necessary to achieve consistency, that Cassandra will 
 communicate with nodes across availability zones.
 
 First, is my understanding correct?  Second, given the high latency that can 
 sometimes exists between availability zones, is this a problem, and instead 
 we should treat availability zones as data centers?
 
 Ideally, we would be able to setup a situation where we could store replicas 
 across availability zones in case of failure, but establish a high level of 
 read or write consistency within a single availability zone.
 
 I appreciate your responses,
 Thanks,
 -Mike
 
 
 
 
 
 
 -- 
 With kind regards,
 
 Robin Verlangen
 Software engineer
 
 W http://www.robinverlangen.nl
 E ro...@us2.nl
 
 Disclaimer: The information contained in this message and attachments is 
 intended solely for the attention and use of the named addressee and may be 
 confidential. If you are not the intended recipient, you are reminded that 
 the information remains the property of the sender. You must not use, 
 disclose, distribute, copy, print or rely on this e-mail. If you have 
 received this message in error, please contact the sender immediately and 
 irrevocably delete this message and any copies.