[ 
https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932917#action_12932917
 ] 

Sylvain Lebresne commented on CASSANDRA-1072:
---------------------------------------------

Thanks for answer. I haven't had the time to have a look at the new patch, 
sorry, but I will as soon as time permits. A few answers to your answers tho.

{quote}
1. Is this the kind of IP address situation you are referring to?  A cluster of 
nodes: A (127.0.0.1), B (127.0.0.2), and C (127.0.0.3) have been running and 
are not fully consistent. They're brought back up w/ shuffled ips, like so: A 
(127.0.0.2), B (127.0.0.3), and C (127.0.0.1). A has the most up-to-date view 
of writes to 127.0.0.1, however, C is now in-charge of writes to 127.0.0.1. 
i.e. any writes to A that C had not seen, previously, have now been lost.
{quote}

That's one scenario but I think you'll actually be very lucky if in such 
scenario, you only "loose a few non replicated updates". There is much (much) 
worst.

Suppose you have your 3 nodes cluster (and say RF=2 or 3). Node A accepts one 
or more counter update and its part of the counter is say 10. This value 10 is 
replicated to B (as part of "repair on write" or read repair). On B, the 
memtable is flushed, this value 10 is in one of B sstable. Now A accepts more 
update(s), yielding the value to say 15. Again, this value 15 is replicated.  
At this point, the cluster is coherent and the value for the counter is 15.  
But say somehow the cluster is shutdown, there is some IP mixup and B is 
restarted with the ip that A had before. Now, any read (on B) will reconcile 
the two values 10 and 15, merge them (because it now believe that these are 
updates it has accepted and as such are deltas, while they are not) and yield 
25. Very quickly, replication will pollute every other node in the cluster with 
this bogus value and compaction will make it permanent.

Potentially, any change of a node IP that uses an IP that has been used for 
another node at some point (even a decommissioned one) can be harmful (and 
dramatically so), unless you know that everything has been compacted nice and 
clean.

So while I agree that such change of IPs are not supposed to be the norm, it 
can, and so it will, happen (even in test environment, where one could be less 
prudent and thus such scenario are even more likely to happen, it will pissed 
off people real bad).

I'm strongly opposed (and will always be) to any change to Cassandra that will 
destroy data because someone in the op team has messed up and hit the enter key 
a bit too quickly. But that's just my humble opinion and its open source, so 
anybody else, please chime in and give yours.

{quote}
A fix with UUIDs is possible but it's beyond the scope of this jira.
{quote}

Because of what's above, I disagree with this. Even more so because I'm not at 
all convinced that this could be easily fixed afterwards.

{quote}
2. Valid issue, but it does sound like something of an edge case. For a first 
version of 1072 it seems reasonable that instructions for ops would be 
sufficient for this problem. If the community then still feels it's a problem 
we can look at how to improve the code.
{quote}

Not sure that's an edge case. Right now, when a node is boostrapped, repair is 
not run automatically at the end of the boostrap (in parts) because more 
failures happen quickly. Thus a good advice is to wait a bit to make sure the 
new node behave alright before running repair on the other nodes, to have a 
quick roll back if the new node doesn't behave correctly. Boostrap followed by 
decommission seems to me bound to happen from time to time (if someone feels 
like confirming/denying ?). That repair have not been run when this happens 
doesn't seems a crazy scenario at all either. And anyway, as for 1, the risk is 
to corrupt data (for the same reason, because a node will merge values that are 
not deltas). I don't consider that "telling people to be careful" is a fix. And 
because I don't think fixing that will be easy, I'm not comfortable with seeing 
that later.

More generally, the counter design is based on some values being merged 
(summed) together (deltas) and other being reconciled as usual based on 
timestamps.  This is a double-edged sword. It allows for quite nice performance 
properties, but it requires to be very careful not to sum two values that 
should not be summed. I don't believe this is something that should be done 
later (especially when we're not sure it can be done later in a satisfactory 
way).

{quote}
3. To resolve this issue we have borrowed the implementation from 
CASSANDRA-1546 (with the added deadlock fix).
{quote}

Cool and thanks for the deadlock fix.

> Increment counters
> ------------------
>
>                 Key: CASSANDRA-1072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1072
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Johan Oskarsson
>            Assignee: Kelvin Kakugawa
>         Attachments: CASSANDRA-1072.patch, increment_test.py, 
> Partitionedcountersdesigndoc.pdf
>
>
> Break out the increment counters out of CASSANDRA-580. Classes are shared 
> between the two features but without the plain version vector code the 
> changeset becomes smaller and more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to