[
https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932917#action_12932917
]
Sylvain Lebresne commented on CASSANDRA-1072:
---------------------------------------------
Thanks for answer. I haven't had the time to have a look at the new patch,
sorry, but I will as soon as time permits. A few answers to your answers tho.
{quote}
1. Is this the kind of IP address situation you are referring to? A cluster of
nodes: A (127.0.0.1), B (127.0.0.2), and C (127.0.0.3) have been running and
are not fully consistent. They're brought back up w/ shuffled ips, like so: A
(127.0.0.2), B (127.0.0.3), and C (127.0.0.1). A has the most up-to-date view
of writes to 127.0.0.1, however, C is now in-charge of writes to 127.0.0.1.
i.e. any writes to A that C had not seen, previously, have now been lost.
{quote}
That's one scenario but I think you'll actually be very lucky if in such
scenario, you only "loose a few non replicated updates". There is much (much)
worst.
Suppose you have your 3 nodes cluster (and say RF=2 or 3). Node A accepts one
or more counter update and its part of the counter is say 10. This value 10 is
replicated to B (as part of "repair on write" or read repair). On B, the
memtable is flushed, this value 10 is in one of B sstable. Now A accepts more
update(s), yielding the value to say 15. Again, this value 15 is replicated.
At this point, the cluster is coherent and the value for the counter is 15.
But say somehow the cluster is shutdown, there is some IP mixup and B is
restarted with the ip that A had before. Now, any read (on B) will reconcile
the two values 10 and 15, merge them (because it now believe that these are
updates it has accepted and as such are deltas, while they are not) and yield
25. Very quickly, replication will pollute every other node in the cluster with
this bogus value and compaction will make it permanent.
Potentially, any change of a node IP that uses an IP that has been used for
another node at some point (even a decommissioned one) can be harmful (and
dramatically so), unless you know that everything has been compacted nice and
clean.
So while I agree that such change of IPs are not supposed to be the norm, it
can, and so it will, happen (even in test environment, where one could be less
prudent and thus such scenario are even more likely to happen, it will pissed
off people real bad).
I'm strongly opposed (and will always be) to any change to Cassandra that will
destroy data because someone in the op team has messed up and hit the enter key
a bit too quickly. But that's just my humble opinion and its open source, so
anybody else, please chime in and give yours.
{quote}
A fix with UUIDs is possible but it's beyond the scope of this jira.
{quote}
Because of what's above, I disagree with this. Even more so because I'm not at
all convinced that this could be easily fixed afterwards.
{quote}
2. Valid issue, but it does sound like something of an edge case. For a first
version of 1072 it seems reasonable that instructions for ops would be
sufficient for this problem. If the community then still feels it's a problem
we can look at how to improve the code.
{quote}
Not sure that's an edge case. Right now, when a node is boostrapped, repair is
not run automatically at the end of the boostrap (in parts) because more
failures happen quickly. Thus a good advice is to wait a bit to make sure the
new node behave alright before running repair on the other nodes, to have a
quick roll back if the new node doesn't behave correctly. Boostrap followed by
decommission seems to me bound to happen from time to time (if someone feels
like confirming/denying ?). That repair have not been run when this happens
doesn't seems a crazy scenario at all either. And anyway, as for 1, the risk is
to corrupt data (for the same reason, because a node will merge values that are
not deltas). I don't consider that "telling people to be careful" is a fix. And
because I don't think fixing that will be easy, I'm not comfortable with seeing
that later.
More generally, the counter design is based on some values being merged
(summed) together (deltas) and other being reconciled as usual based on
timestamps. This is a double-edged sword. It allows for quite nice performance
properties, but it requires to be very careful not to sum two values that
should not be summed. I don't believe this is something that should be done
later (especially when we're not sure it can be done later in a satisfactory
way).
{quote}
3. To resolve this issue we have borrowed the implementation from
CASSANDRA-1546 (with the added deadlock fix).
{quote}
Cool and thanks for the deadlock fix.
> Increment counters
> ------------------
>
> Key: CASSANDRA-1072
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1072
> Project: Cassandra
> Issue Type: Sub-task
> Components: Core
> Reporter: Johan Oskarsson
> Assignee: Kelvin Kakugawa
> Attachments: CASSANDRA-1072.patch, increment_test.py,
> Partitionedcountersdesigndoc.pdf
>
>
> Break out the increment counters out of CASSANDRA-580. Classes are shared
> between the two features but without the plain version vector code the
> changeset becomes smaller and more manageable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.