Re: Cluster Maintenance Mishap

Branton Davis Fri, 21 Oct 2016 08:17:43 -0700

Thanks.  Unfortunately, we lost our system logs during all of this
(had normal logs, but not system) due to an unrelated issue :/


Anyhow, as far as I can tell, we're doing okay.

On Thu, Oct 20, 2016 at 11:18 PM, Jeremiah D Jordan <
jeremiah.jor...@gmail.com> wrote:

> The easiest way to figure out what happened is to examine the system log.
> It will tell you what happened.  But I’m pretty sure your nodes got new
> tokens during that time.
>
> If you want to get back the data inserted during the 2 hours you could use
> sstableloader to send all the data from the 
> /var/data/cassandra_new/cassandra/*
> folders back into the cluster if you still have it.
>
> -Jeremiah
>
>
>
> On Oct 20, 2016, at 3:58 PM, Branton Davis <branton.da...@spanning.com>
> wrote:
>
> Howdy folks.  I asked some about this in IRC yesterday, but we're looking
> to hopefully confirm a couple of things for our sanity.
>
> Yesterday, I was performing an operation on a 21-node cluster (vnodes,
> replication factor 3, NetworkTopologyStrategy, and the nodes are balanced
> across 3 AZs on AWS EC2).  The plan was to swap each node's existing 1TB
> volume (where all cassandra data, including the commitlog, is stored) with
> a 2TB volume.  The plan for each node (one at a time) was basically:
>
>    - rsync while the node is live (repeated until there were only minor
>    differences from new data)
>    - stop cassandra on the node
>    - rsync again
>    - replace the old volume with the new
>    - start cassandra
>
> However, there was a bug in the rsync command.  Instead of copying the
> contents of /var/data/cassandra to /var/data/cassandra_new, it copied it to
> /var/data/cassandra_new/cassandra.  So, when cassandra was started after
> the volume swap, there was some behavior that was similar to bootstrapping
> a new node (data started streaming in from other nodes).  But there
> was also some behavior that was similar to a node replacement (nodetool
> status showed the same IP address, but a different host ID).  This
> happened with 3 nodes (one from each AZ).  The nodes had received 1.4GB,
> 1.2GB, and 0.6GB of data (whereas the normal load for a node is around
> 500-600GB).
>
> The cluster was in this state for about 2 hours, at which point cassandra
> was stopped on them.  Later, I moved the data from the original volumes
> back into place (so, should be the original state before the operation) and
> started cassandra back up.
>
> Finally, the questions.  We've accepted the potential loss of new data
> within the two hours, but our primary concern now is what was happening
> with the bootstrapping nodes.  Would they have taken on the token ranges
> of the original nodes or acted like new nodes and got new token ranges?  If
> the latter, is it possible that any data moved from the healthy nodes to
> the "new" nodes or would restarting them with the original data (and
> repairing) put the cluster's token ranges back into a normal state?
>
> Hopefully that was all clear.  Thanks in advance for any info!
>
>
>

Re: Cluster Maintenance Mishap

Reply via email to