Re: failure node rejoin

Yabin Meng Mon, 17 Oct 2016 09:51:52 -0700

The exception you run into is expected behavior. This is because as Ben
pointed out, when you delete everything (including system schemas), C*
cluster thinks you're bootstrapping a new node. However,  node2's IP is
still in gossip and this is why you see the exception.


I'm not clear the reasoning why you need to delete C* data directory. That
is a dangerous action, especially considering that you delete system
schemas. If in any case the failure node is gone for a while, what you need
to do is to is remove the node first before doing "rejoin".

Cheers,

Yabin

On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <ben.sla...@instaclustr.com>
wrote:

> To cassandra, the node where you deleted the files looks like a brand new
> machine. It doesn’t automatically rebuild machines to prevent accidental
> replacement. You need to tell it to build the “new” machines as a
> replacement for the “old” machine with that IP by setting
> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
> http://cassandra.apache.org/doc/latest/operating/topo_changes.html.
>
> Cheers
> Ben
>
> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <y...@imagine-orb.com> wrote:
>
>> Hi all,
>>
>> A failure node can rejoin a cluster.
>> On the node, all data in /var/lib/cassandra were deleted.
>> Is it normal?
>>
>> I can reproduce it as below.
>>
>> cluster:
>> - C* 2.2.7
>> - a cluster has node1, 2, 3
>> - node1 is a seed
>> - replication_factor: 3
>>
>> how to:
>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>> ($sudo rm -rf /var/lib/cassandra/*)
>> 2) stop C* process on node1 and node3
>> 3) restart C* on node1
>> 4) restart C* on node2
>>
>> nodetool status after 4):
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>                           Rack
>> DN  [node3 IP]  ?                 256          100.0%
>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>> UN  [node2 IP]  7.76 MB      256          100.0%
>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>> UN  [node1 IP]  416.13 MB  256          100.0%
>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>
>> If I restart C* on node 2 when C* on node1 and node3 are running (without
>> 2), 3)), a runtime exception happens.
>> RuntimeException: "A node with address [node2 IP] already exists,
>> cancelling join..."
>>
>> I'm not sure this causes data lost. All data can be read properly just
>> after this rejoin.
>> But some rows are lost when I kill&restart C* for destructive tests after
>> this rejoin.
>>
>> Thanks.
>>
>> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>

Re: failure node rejoin

Reply via email to