[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304281#comment-15304281
 ] 

Paulo Motta commented on CASSANDRA-8523:
----------------------------------------

Initial route I will take is to create a new dead state for replace, that adds 
the node as replacing endpoint to TokenMetadata. When a replacement endpoint is 
added to TokenMetadata, if there's an existing down node with the same IP, then 
we remove it from natural endpoints so the replacement node no longer receive 
reads when it becomes alive in the FD and include the replacement node as 
pending joining endpoint so writes are forwarded to it. After pending ranges 
for replace are calculated the final step is to set the node as alive in the FD 
and change the dead state logic to not mark "alive" dead state nodes as DOWN 
automatically (we will probably rename this nomenclature to a better name, like 
invisible or whatever), so the FD will work as usual and remove the replacing 
endpoint from TokenMetadata (and restore it as a natural endpoint) if it 
becomes down. The current streaming logic will probably be unaffected by this, 
and the node will change its state to NORMAL after stream is finished 
completing the replace procedure.

The only downside I can think of this approach is that we will lose hints 
during a failed replace, but this is not a big deal as hints are an 
optimization, and replace will probably take longer than max_hint_window anyway.

I will start going this route, feel free to give any feedback or let me know if 
I'm missing something on this high level flow.

> Writes should be sent to a replacement node while it is streaming in data
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8523
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Richard Wagner
>            Assignee: Paulo Motta
>             Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to