[ 
https://issues.apache.org/jira/browse/CASSANDRA-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211202#comment-15211202
 ] 

Paulo Motta commented on CASSANDRA-9244:
----------------------------------------

What we want here is to restore replica count on other nodes during a node 
replacement, since replica placement might change on other nodes when the new 
node is introduced.

For example, imagine nodes A, B, C from racks R1, R1 and R2 respectively. A 
token X that maps to node A must be replicated to A(R1) and C(R2). If node A is 
replaced with node D from rack R2, the same token X should now be replicated to 
node D (R2) and B (R1). So the secondary replica placement of node A changed 
from node C to node B, and thus node C should stream part of its data to B when 
D is replacing A.

To solve this we basically need to tell other nodes that we are replacing and 
ask them to restore replica count, similar to what is done when {{nodetool 
removenode}} is called. A simple way to do this would be to add a 
{{RESTORE_REPLICA_COUNT}} message that would be triggered by the replacing node 
on the nodes that have its replicas affected by the node replacement. While 
this would solve part of the problem, ongoing writes during the replacement 
process would not be redirected to the new replicas a la CASSANDRA-8523, 
reinforcing the need for repair after a replace.

I implemented an initial solution (available on this 
[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:9244])
 that reworks the replace process by adding a new gossip state BOOT_REPLACE, 
which is similar to BOOT but calculates pending ranges and restore replica 
counts on other nodes during the replace, solving both this and CASSANDRA-8523.

While the solution works beautifully when {{replace_address != 
broadcast_address}} (which I stupidly assumed would be the only case, given my 
previous EC2 background), the problem arises when {{replace_address == 
broadcast_address}} since other nodes think the old node is back in town and 
starts sending reads to him given the node is a natural endpoint (I'd probably 
figure out this earlier if I talked to [~brandon.williams] before, but well).

Two ways I can think of to solve this would be:
1. keep the node as natural endpoint and do not send reads to him, by checking 
if its on {{NORMAL}} state before contacting him, such as in the 
{{StorageService.getLiveNaturalEndpoints}} method.
2. remove the node from the natural endpoint list, but then we would need to 
restore the ring if the replace fails, deal with hints, clients would probably 
be confused, etc.

I'm leaning more towards 1, but I still think both of them look a bit 
workaroundish, error-prone and probably not ideal.

After this I realized that this would be much easier to deal with if we indexed 
nodes on {{TokenMetadata}} by {{UUID}} rather than {{InetAddress}} as we 
currently do probably for historical reasons. This would allow for instance, to 
have a DOWN natural endpoint with {{UUID=X}} and a UP pending/replacing 
endpoint with {{UUID=Y}}, even though they have the same {{InetAddress}} and 
tokens. However for this to be done right we would probably need to treat 
endpoints by UUID in other places as well (including FD, gossiper, 
StorageProxy, etc) and deal with {{InetAddress}} mostly on messaging service 
and similar network entry points.

While the effort would be relatively big, I think it would be mostly mechanical 
and it would probably be mostly internal code changes, not affecting 
communication, storage or upgrades. Besides allowing us to implement replace 
correctly once and for all, this would probably make other parts of the code 
simpler as well (such as the {{prefer_local}} logic). But I'm still not sure if 
this would be totally doable and worth the effort. Perharps we could do it as a 
first shot of CASSANDRA-6061 and target it for 4.0? WDYT [~brandon.williams], 
[~thobbs], [~jkni] ?

> replace_address is not topology-aware
> -------------------------------------
>
>                 Key: CASSANDRA-9244
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9244
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata, Streaming and Messaging
>         Environment: 2.0.12
>            Reporter: Rick Branson
>            Assignee: Paulo Motta
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>
> Replaced a node with one in another rack (using replace_address) and it 
> caused improper distribution after the bootstrap was finished. It looks like 
> the ranges for the streams are not created in a way that is topology-aware. 
> This should probably either be prevented, or ideally, would work properly. 
> The use case is migrating several nodes from one rack to another.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to